Method and device for video coding and decoding

ABSTRACT

There is disclosed a method for encoding at least two views of a video scene into a multiview video bitstream, where said views have different spatial resolutions. The method comprises prediction between pictures belonging to different views after resampling of one of these pictures. There is also disclosed a method for decoding a multiview video bitstream comprising at least two views having different spatial resolutions. The method comprises prediction between pictures belonging to different views after resampling of one of these pictures. There are also disclosed corresponding apparatuses and computer program products.

RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No.61/405,159, filed Oct. 20, 2010, the content of which is incorporatedherein by reference in its entirety.

TECHNICAL FIELD

This invention relates to video coding and decoding. In particular, thepresent invention relates to the use of scalable video coding fordifferent views of multiview video coding content.

BACKGROUND INFORMATION

This section is intended to provide a background or context to theinvention that is recited in the claims. The description herein mayinclude concepts that could be pursued, but are not necessarily onesthat have been previously conceived or pursued. Therefore, unlessotherwise indicated herein, what is described in this section is notprior art to the description and claims in this application and is notadmitted to be prior art by inclusion in this section.

In order to facilitate communication of video content over one or morenetworks, several coding standards have been developed. Video codingstandards include ITU-T H.261, ISO/IEC MPEG-1 Video, ITU-T H.262 orISO/IEC MPEG-2 Video, ITU-T H.263, ISO/IEC MPEG-4 Visual, ITU-T H.264(also know as ISO/IEC MPEG-4 AVC), the scalable video coding (SVC)extension of H.264/AVC, and the multiview video coding (MVC) extensionof H.264/AVC. In addition, there are currently efforts underway todevelop new video coding standards.

In scalable video coding, a video signal can be encoded into a baselayer and one or more enhancement layers constructed. An enhancementlayer enhances the temporal resolution (i.e., the frame rate), thespatial resolution, or simply the quality of the video contentrepresented by another layer or part thereof. Each layer together withall its dependent layers is one representation of the video signal at acertain spatial resolution, temporal resolution and quality level. Inthis document, we refer to a scalable layer together with all of itsdependent layers as a “scalable layer representation”. The portion of ascalable bitstream corresponding to a scalable layer representation canbe extracted and decoded to produce a representation of the originalsignal at certain fidelity.

Compressed multi-view video sequences require a considerable bitrate.They may have been coded for a spatial resolution (picture size in termsof pixels) or picture quality (spatial details) that are unnecessary fora display in use or unfeasible for a computational capacity in use whilebeing suitable for another display and another computational complexityin use. In many systems, it would therefore be desirable to adjust thetransmitted or processed bitrate, the picture rate, the picture size, orthe picture quality of a compressed multi-view video bitstream. Thecurrent multi-view video coding solutions offer scalability only interms of view scalability (selecting which views are decoded) ortemporal scalability (selecting the picture rate at which the sequenceis decoded.

The multi-view video profile of MPEG-2 video enables stereoscopic(2-view) video coded as if views were layers of a scalable MPEG-2 videobitstream, where a base layer is assigned to a left view and anenhancement layer is assigned to a right view. The multi-view videoextension of H.264/AVC has been built on top of H.264/AVC, whichprovides only temporal scalability.

One branch of research in stereoscopic video compression is known asmixed-resolution (MR) stereoscopic video coding. In MR stereoscopicvideo, one of the two views is represented with a lower resolutioncompared to the other one, while, according to the binocular visiontheory, it is assumed that the Human Visual System (HVS) fuses the twoimages such that the perceived quality is close to that of the higherquality view. In one study, the breakdown point where thehigher-resolution view was no longer dominant in the perceived qualityseemed to be between 11.4 and 7.6 pixels per degree of viewing angle.

Two asymmetric multiview video coding schemes have been presented: aquality asymmetry achieved with Medium Grain Scalability (MGS) or FineGrain Scalability (FGS), and a spatially scalable mixed-resolutionbitstream. In the latter scheme, equivalent layers in different viewsare of different resolutions and the equivalent layers have to be prunedjointly. For example, there are two views, view 0 and view 1, bothhaving a base layer and one spatial enhancement layer. For view 0, thebase layer is coded as VGA and the enhancement layer as 4VGA. For view1, the base layer is coded as QVGA and the enhancement layer as VGA. Theencoder uses asymmetric inter-view prediction between the views in bothlayers. That is, when the enhancement layer of view 1 is decoded, thedecoded picture resulting from view 0 (both base and enhancement layers)is downsampled to be used as an inter-view reference. When the baselayer of view 1 is decoded (i.e., the enhancement layer is removed fromthe bitstream), the decoded picture resulting from the base layer ofview 0 is downsampled to be used as an inter-view reference. The encodersets the pruning order indicator of the enhancement layers of both viewsto be the same. Consequently, a bitstream resulting to decoding of boththe base layer and the enhancement layer of view 0 and only base layerof view 1 won't be possible.

Reference Picture Resampling was specified as Annex P of ITU-TRecommendation H.263. The annex describes the use and syntax of aresampling process which can be applied to the previous decodedreference picture in order to generate a “warped” picture for use inpredicting the current picture. This resampling syntax can specify therelationship of the current picture to a prior picture having adifferent source format, and can also specify a “global motion” warpingalteration of the shape, size, and location of the prior picture withrespect to the current picture. In particular, the Reference PictureResampling mode can be used to adaptively alter the resolution ofpictures during encoding. The Reference Picture Resampling mode can beinvoked implicitly by the occurrence of a picture header for an INTERcoded picture having a picture size which differs from that of theprevious encoded picture.

SUMMARY

In one aspect, the invention relates to a method for encoding a firstuncompressed picture of a first view and a second uncompressed pictureof a second view into a bitstream. The method comprises:

encoding a first uncompressed picture;

reconstructing a first decoded picture on the basis of the encoding ofthe first uncompressed picture;

resampling at least a part of the first decoded picture into a firstresampled decoded picture; and

encoding a second uncompressed picture as a first dependencyrepresentation and a second dependency representation,

wherein the first resampled decoded picture is used as a predictionreference for the encoding of the first dependency representation;

the first decoded picture is used as a prediction reference for theencoding of the second dependency representation; and

the first dependency representation is used in the encoding of thesecond dependency representation.

According to a second aspect there is provided an apparatus comprising:

an encoder configured for encoding the first uncompressed picture of afirst view;

a reconstructor configured for reconstructing a first decoded picture onthe basis of the encoding of the first uncompressed picture;

a sampler configured for resampling at least a part of the first decodedpicture into a first resampled decoded picture; and

said encoder being further configured for

encoding a second uncompressed picture as a first dependencyrepresentation by using the first resampled decoded picture as aprediction reference, and

encoding a second dependency representation of a second view by usingthe first decoded picture as a prediction reference and the firstdependency representation in the encoding of the second dependencyrepresentation.

According to a third aspect there is provided an apparatus comprising:

a processor; and

a memory unit operatively connected to the processor and including:

computer code configured to:

encode a first uncompressed picture of a first view;

reconstruct a first decoded picture on the basis of the encoding of thefirst uncompressed picture;

resample at least a part of the first decoded picture into a firstresampled decoded picture; and

encode a second uncompressed picture of a second view as a firstdependency representation and a second dependency representation,

wherein the first resampled decoded picture is used as a predictionreference for the encoding of the first dependency representation;

the first decoded picture is used as a prediction reference for theencoding of the second dependency representation; and

the first dependency representation is used in the encoding of thesecond dependency representation.

According to a fourth aspect there is provided a method for decoding amultiview video bitstream comprising a first view component of a firstview and a second view component of a second view, the methodcomprising:

decoding the first view component into a first decoded picture;

determining a spatial resolution of the first view component and aspatial resolution of the second view component;

on the basis of the spatial resolution of the first view component beingdifferent from the spatial resolution of the second view component:

resampling at least a part of the first decoded picture into a firstresampled decoded picture;

decoding the second view component using the first resampled decodedpicture as a prediction reference.

According to a fifth aspect there is provided an apparatus comprising:

a decoder configured for decoding a first view component of a first viewinto a first decoded picture;

a determining element configured for determining a spatial resolution ofthe first view component being different from a spatial resolution of asecond view component of a second view;

a sampler configured for resampling at least a part of the first decodedpicture into a first resampled decoded picture when the spatialresolution of the first view component differs from the spatialresolution of the second view component; and

said decoder being further configured for decoding the second viewcomponent using the first resampled decoded picture as a predictionreference.

According to a sixth aspect there is provided an apparatus comprising:

a processor; and

a memory unit operatively connected to the processor and including

computer code configured to:

decode the first view component of a first view into a first decodedpicture;

determine a spatial resolution of the first view component and a spatialresolution of a second view component of a second view;

on the basis of the spatial resolution of the first view component beingdifferent from the spatial resolution of the second view component:

resample at least a part of the first decoded picture into a firstresampled decoded picture;

decode the second view component using the first resampled decodedpicture as a prediction reference.

According to a seventh aspect there is provided a computer readablestorage medium stored with code thereon for use by an apparatus, whichwhen executed by a processor, causes the apparatus to perform:

encode a first uncompressed picture of a first view;

reconstruct a first decoded picture on the basis of the encoding of thefirst uncompressed picture;

resample at least a part of the first decoded picture into a firstresampled decoded picture; and

encode a second uncompressed picture of a second view as a firstdependency representation and a second dependency representation,

wherein the code, which when executed by a processor, further causes theapparatus to:

use the first resampled decoded picture as a prediction reference forthe encoding of the first dependency representation;

use the first decoded picture as a prediction reference for the encodingof the second dependency representation; and

use the first dependency representation in the encoding of the seconddependency representation.

According to an eighth aspect there is provided a computer readablestorage medium stored with code thereon for use by an apparatus, whichwhen executed by a processor, causes the apparatus to perform:

decode a first view component of a first view into a first decodedpicture;

determine a spatial resolution of the first view component and a spatialresolution of a second view component of a second view;

on the basis of the spatial resolution of the first view component beingdifferent from the spatial resolution of the second view component:

resample at least a part of the first decoded picture into a firstresampled decoded picture;

decode the second view component using the first resampled decodedpicture as a prediction reference.

According to a ninth aspect there is provided at least one processor andat least one memory, said at least one memory stored with code thereon,which when executed by said at least one processor, causes an apparatusto perform:

encode the first uncompressed picture of a first view;

reconstruct a first decoded picture on the basis of the encoding of thefirst uncompressed picture;

resample at least a part of the first decoded picture into a firstresampled decoded picture; and

encode a second uncompressed picture of a second view as a firstdependency representation and a second dependency representation,

wherein the code, which when executed by a processor, further causes theapparatus to:

use the first resampled decoded picture as a prediction reference forthe encoding of the first dependency representation;

use the first decoded picture as a prediction reference for the encodingof the second dependency representation; and

use the first dependency representation in the encoding of the seconddependency representation.

According to a tenth aspect there is provided at least one processor andat least one memory, said at least one memory stored with code thereon,which when executed by said at least one processor, causes an apparatusto perform:

decode a first view component of a first view into a first decodedpicture;

determine a spatial resolution of the first view component and a spatialresolution of a second view component of a second view;

on the basis of the spatial resolution of the first view component beingdifferent from the spatial resolution of the second view component:

resample at least a part of the first decoded picture into a firstresampled decoded picture;

decode the second view component using the first resampled decodedpicture as a prediction reference.

According to an eleventh aspect there is provided an apparatuscomprising:

means for encoding a first uncompressed picture of a first view;

means for reconstructing a first decoded picture on the basis of theencoding of the first uncompressed picture;

means for resampling at least a part of the first decoded picture into afirst resampled decoded picture; and

means for encoding a second uncompressed picture of a second view as afirst dependency representation by using the first resampled decodedpicture as a prediction reference, and

means for encoding a second dependency representation by using the firstdecoded picture as a prediction reference and the first dependencyrepresentation in the encoding of the second dependency representation.

According to a twelfth aspect there is provided an apparatus comprising:

means for decoding a first view component of a first view into a firstdecoded picture;

means for determining a spatial resolution of the first view componentbeing different from a spatial resolution of a second view component ofa second view;

means for resampling at least a part of the first decoded picture into afirst resampled decoded picture when the spatial resolution of the firstview component differs from the spatial resolution of the second viewcomponent; and

means for decoding the second view component using the first resampleddecoded picture as a prediction reference.

In some embodiments a scalable coding of multiview video bitstreams isimplemented in such a manner that scalable layers can be pruned unevenlybetween views. For example, the base view may be non-scalably coded,while the non-base view is spatially scalably coded. The inter-viewprediction from the base view is adapted on the basis of which scalablelayers are present in the non-base view.

The capability or preference of receivers to decode full-resolution ormixed-resolution video may not be known at the time of encoding or thereare receivers of both type receiving the same bitstream. Afull-resolution symmetric stereo video bitstream may be adapted in agateway to become a mixed-resolution bitstream to meet receiver'scapabilities/preferences and/or downlink network throughput.

Services or transmission schemes falling under these constraints includethe following:

Multiparty video conferencing with heterogeneous receivers or networkcapability. A multipoint conference control unit (MCU) adapts thebitstream according to downlink throughput and/or receivercapabilities/preferences.

IP multicast. The base and enhancement layers of the non-base view aretransmitted in distinct multicast groups, and receivers may subscribe toonly the base layer or both layers.

Application-layer multicast (a.k.a. peer-to-peer streaming). Each relaynode forwards the bitstream according to downlink throughput and/orreceiver capabilities/preferences.

Broadcast. Some receivers might decode mixed-resolution stereo video asopposed to full-resolution symmetric stereo video in order to savecomputational resources.

Local file playback. At the time of generating the file, thecomputational capability of the player device is not known.

In some embodiments the receiver's preference for receivingmixed-resolution stereo video bitstream may be based on the analysis ofviewer distance from the display.

DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are described by referring to the attacheddrawings, in which:

FIG. 1 illustrates an exemplary hierarchical coding structure withtemporal scalability;

FIG. 2 illustrates an exemplary MVC decoding order;

FIG. 3 illustrates an exemplary MVC prediction structure for multi-viewvideo coding;

FIG. 4 is an overview diagram of a system within which variousembodiments of the present invention may be implemented;

FIG. 5 illustrates a perspective view of an exemplary electronic devicewhich may be utilized in accordance with the various embodiments of thepresent invention;

FIG. 6 is a schematic representation of the circuitry which may beincluded in the electronic device of FIG. 5;

FIG. 7 is a graphical representation of a generic multimediacommunication system within which various embodiments may beimplemented;

FIG. 8 illustrates an example of a scalable stereoscopic coding schemeenabling bitstream pruning to a mixed-resolution stereoscopic video;

FIG. 9 illustrates a modified inter-view prediction when encoding ordecoding mixed-resolution stereoscopic video;

FIG. 10 is a flow diagram of an encoding method according to an exampleembodiment of the present invention;

FIG. 11 is a flow diagram of a decoding method according to an exampleembodiment of the present invention; and

FIG. 12 is a schematic representation of a converter according to anexample embodiment of the present invention.

DETAILED DESCRIPTION OF SOME EXAMPLE EMBODIMENTS

In the following description, for purposes of explanation and notlimitation, details and descriptions are set forth in order to provide athorough understanding of the present invention. However, it will beapparent to those skilled in the art that the present invention may bepracticed in other embodiments that depart from these details anddescriptions.

As noted above, in scalable video coding, a video signal can be encodedinto a base layer and one or more enhancement layers constructed. Anenhancement layer enhances the temporal resolution (i.e., the framerate), the spatial resolution, or simply the quality of the videocontent represented by another layer or part thereof. Each layertogether with all its dependent layers is one representation of thevideo signal at a certain spatial resolution, temporal resolution andquality level. In this document, we refer to a scalable layer togetherwith all of its dependent layers as a “scalable layer representation”.The portion of a scalable bitstream corresponding to a scalable layerrepresentation can be extracted and decoded to produce a representationof the original signal at certain fidelity.

In some cases, data in an enhancement layer can be truncated after acertain location, or even at arbitrary positions, where each truncationposition may include additional data representing increasingly enhancedvisual quality. Such scalability is referred to as fine-grained(granularity) scalability (FGS). FGS was included in some draft versionsof the SVC standard, but it was eventually excluded from the final SVCstandard. FGS is subsequently discussed in the context of some draftversions of the SVC standard. The scalability provided by thoseenhancement layers that cannot be truncated is referred to ascoarse-grained (granularity) scalability (CGS). It collectively includesthe traditional quality (SNR) scalability and spatial scalability. TheSVC standard supports the so-called medium-grained scalability (MGS),where quality enhancement pictures are coded similarly to SNR scalablelayer pictures but indicated by high-level syntax elements similarly toFGS layer pictures, by having the quality_id syntax element greater than0.

SVC uses an inter-layer prediction mechanism, wherein certaininformation can be predicted from layers other than the currentlyreconstructed layer or the next lower layer. Information that could beinter-layer predicted includes intra texture, motion and residual data.Inter-layer motion prediction includes the prediction of block codingmode, header information, etc., wherein motion from the lower layer maybe used for prediction of the higher layer. In case of intra coding, aprediction from surrounding macroblocks or from co-located macroblocksof lower layers is possible. These prediction techniques do not employinformation from earlier coded access units and hence, are referred toas intra prediction techniques. Furthermore, residual data from lowerlayers can also be employed for prediction of the current layer.

SVC specifies a concept known as single-loop decoding. It is enabled byusing a constrained intra texture prediction mode, whereby theinter-layer intra texture prediction can be applied to macroblocks (MBs)for which the corresponding block of the base layer is located insideintra-MBs. At the same time, those intra-MBs in the base layer useconstrained intra-prediction (e.g., having the syntax element“constrained_intra_pred_flag” equal to 1). In single-loop decoding, thedecoder performs motion compensation and full picture reconstructiononly for the scalable layer desired for playback (called the “desiredlayer” or the “target layer”), thereby greatly reducing decodingcomplexity. All of the layers other than the desired layer do not needto be fully decoded because all or part of the data of the MBs not usedfor inter-layer prediction (be it inter-layer intra texture prediction,inter-layer motion prediction or inter-layer residual prediction) is notneeded for reconstruction of the desired layer.

A single decoding loop is needed for decoding of most pictures, while asecond decoding loop is selectively applied to reconstruct the baserepresentations, which are needed as prediction references but not foroutput or display, and are reconstructed only for the so called keypictures (for which “store_ref_base_pic_flag” is equal to 1).

The scalability structure in the SVC draft is characterized by threesyntax elements: “temporal_id,” “dependency_id” and “quality_id.” Thesyntax element “temporal_id” is used to indicate the temporalscalability hierarchy or, indirectly, the frame rate. A scalable layerrepresentation comprising pictures of a smaller maximum “temporal_id”value has a smaller frame rate than a scalable layer representationcomprising pictures of a greater maximum “temporal_id”. A given temporallayer typically depends on the lower temporal layers (i.e., the temporallayers with smaller “temporal_id” values) but does not depend on anyhigher temporal layer. The syntax element “dependency_id” is used toindicate the CGS inter-layer coding dependency hierarchy (which, asmentioned earlier, includes both SNR and spatial scalability). At anytemporal level location, a picture of a smaller “dependency_id” valuemay be used for inter-layer prediction for coding of a picture with agreater “dependency_id” value. The syntax element “quality_id” is usedto indicate the quality level hierarchy of a FGS or MGS layer. At anytemporal location, and with an identical “dependency_id” value, apicture with “quality_id” equal to QL uses the picture with “quality_id”equal to QL−1 for inter-layer prediction. A coded slice with“quality_id” larger than 0 may be coded as either a truncatable FGSslice or a non-truncatable MGS slice.

For simplicity, all the data units (e.g., Network Abstraction Layerunits or NAL units in the SVC context) in one access unit havingidentical value of “dependency_id” are referred to as a dependency unitor a dependency representation. Within one dependency unit, all the dataunits having identical value of “quality_id” are referred to as aquality unit or layer representation.

A base representation, also known as a decoded base picture, is adecoded picture resulting from decoding the Video Coding Layer (VCL) NALunits of a dependency unit having “quality_id” equal to 0 and for whichthe “store_ref_base_pic_flag” is set equal to 1. An enhancementrepresentation, also referred to as a decoded picture, results from theregular decoding process in which all the layer representations that arepresent for the highest dependency representation are decoded.

Each H.264/AVC VCL NAL unit (with NAL unit type in the scope of 1 to 5)is preceded by a prefix NAL unit in an SVC bitstream. A compliantH.264/AVC decoder implementation ignores prefix NAL units. The prefixNAL unit includes the “temporal_id” value and hence an SVC decoder, thatdecodes the base layer, can learn from the prefix NAL units the temporalscalability hierarchy. Moreover, the prefix NAL unit includes referencepicture marking commands for base representations.

SVC uses the same mechanism as H.264/AVC to provide temporalscalability. Temporal scalability provides refinement of the videoquality in the temporal domain, by giving flexibility of adjusting theframe rate. A review of temporal scalability is provided in thesubsequent paragraphs.

The earliest scalability introduced to video coding standards wastemporal scalability with B pictures in MPEG-1 Visual. In this B pictureconcept, a B picture is bi-predicted from two pictures, one precedingthe B picture and the other succeeding the B picture, both in displayorder. In bi-prediction, two prediction blocks from two referencepictures are averaged sample-wise to get the final prediction block.Conventionally, a B picture is a non-reference picture (i.e., it is notused for inter picture prediction reference by other pictures).Consequently, the B pictures could be discarded to achieve a temporalscalability point with a lower frame rate. The same mechanism wasretained in MPEG-2 Video, H.263 and MPEG-4 Visual.

In H.264/AVC, the concept of B pictures or B slices has been changed.The definition of B slice is as follows: A slice that may be decodedusing intra prediction from decoded samples within the same slice orinter prediction from previously decoded reference pictures, using atmost two motion vectors and reference indices to predict the samplevalues of each block. Both the bi-directional prediction property andthe non-reference picture property of the conventional B picture conceptare no longer valid. A block in a B slice may be predicted from tworeference pictures in the same direction in display order, and a pictureconsisting of B slices may be referred by other pictures forinter-picture prediction.

In H.264/AVC, SVC and MVC, temporal scalability can be achieved by usingnon-reference pictures and/or hierarchical inter-picture predictionstructure. Using only non-reference pictures is able to achieve similartemporal scalability as using conventional B pictures in MPEG-1/2/4, bydiscarding non-reference pictures. Hierarchical coding structure canachieve more flexible temporal scalability.

Referring now to FIG. 1, an exemplary hierarchical coding structure isillustrated with four levels of temporal scalability. The display orderis indicated by the values denoted as picture order count (POC) 210. TheI or P pictures, such as I/P picture 212, also referred to as keypictures, are coded as the first picture of a group of pictures (GOPs)214 in decoding order. When a key picture (e.g., key picture 216, 218)is inter-coded, the previous key pictures 212, 216 are used as referencefor inter-picture prediction. These pictures correspond to the lowesttemporal level 220 (denoted as TL in the figure) in the temporalscalable structure and are associated with the lowest frame rate.Pictures of a higher temporal level may only use pictures of the same orlower temporal level for inter-picture prediction. With such ahierarchical coding structure, different temporal scalabilitycorresponding to different frame rates can be achieved by discardingpictures of a certain temporal level value and beyond. In FIG. 1, thepictures 0, 8 and 16 are of the lowest temporal level, while thepictures 1, 3, 5, 7, 9, 11, 13 and 15 are of the highest temporal level.Other pictures are assigned with other temporal level hierarchically.These pictures of different temporal levels compose the bitstream ofdifferent frame rate. When decoding all the temporal levels, a framerate of 30 Hz is obtained. Other frame rates can be obtained bydiscarding pictures of some temporal levels. The pictures of the lowesttemporal level are associated with the frame rate of 3.25 Hz. A temporalscalable layer with a lower temporal level or a lower frame rate is alsocalled as a lower temporal layer.

The above-described hierarchical B picture coding structure is the mosttypical coding structure for temporal scalability. However, it is notedthat much more flexible coding structures are possible. For example, theGOP size may not be constant over time. In another example, the temporalenhancement layer pictures do not have to be coded as B slices; they mayalso be coded as P slices.

In H.264/AVC, the temporal level may be signaled by the sub-sequencelayer number in the sub-sequence information Supplemental EnhancementInformation (SEI) messages. In SVC, the temporal level is signaled inthe Network Abstraction Layer (NAL) unit header by the syntax element“temporal_id.” The bitrate and frame rate information for each temporallevel is signaled in the scalability information SEI message.

As mentioned earlier, CGS includes both spatial scalability and SNRscalability. Spatial scalability is initially designed to supportrepresentations of video with different resolutions. For each timeinstance, VCL NAL units are coded in the same access unit and these VCLNAL units can correspond to different resolutions. During the decoding,a low resolution VCL NAL unit provides the motion field and residualwhich can be optionally inherited by the final decoding andreconstruction of the high resolution picture. When compared to oldervideo compression standards, SVC's spatial scalability has beengeneralized to enable the base layer to be a cropped and zoomed versionof the enhancement layer.

MGS quality layers are indicated with “quality_id” similarly as FGSquality layers. For each dependency unit (with the same“dependency_id”), there is a layer with “quality_id” equal to 0 and canbe other layers with “quality_id” greater than 0. These layers with“quality_id” greater than 0 are either MGS layers or FGS layers,depending on whether the slices are coded as truncatable slices.

In the basic form of FGS enhancement layers, only inter-layer predictionis used. Therefore, FGS enhancement layers can be truncated freelywithout causing any error propagation in the decoded sequence. However,the basic form of FGS suffers from low compression efficiency. Thisissue arises because only low-quality pictures are used for interprediction references. It has therefore been proposed that FGS-enhancedpictures be used as inter prediction references. However, this causesencoding-decoding mismatch, also referred to as drift, when some FGSdata are discarded.

One feature of SVC is that the FGS NAL units can be freely dropped ortruncated, and MGS NAL units can be freely dropped (but cannot betruncated) without affecting the conformance of the bitstream. Asdiscussed above, when those FGS or MGS data have been used for interprediction reference during encoding, dropping or truncation of the datawould result in a mismatch between the decoded pictures in the decoderside and in the encoder side. This mismatch is also referred to asdrift.

To control drift due to the dropping or truncation of FGS or MGS data,SVC applied the following solution: In a certain dependency unit, a baserepresentation (by decoding only the CGS picture with “quality_id” equalto 0 and all the dependent-on lower layer data) is stored in the decodedpicture buffer. When encoding a subsequent dependency unit with the samevalue of “dependency_id,” all of the NAL units, including FGS or MGS NALunits, use the base representation for inter prediction reference.Consequently, all drift due to dropping or truncation of FGS or MGS NALunits in an earlier access unit is stopped at this access unit. Forother dependency units with the same value of “dependency_id,” all ofthe NAL units use the decoded pictures for inter prediction reference,for high coding efficiency.

Each NAL unit includes in the NAL unit header a syntax element“use_ref_base_pic_flag.” When the value of this element is equal to 1,decoding of the NAL unit uses the base representations of the referencepictures during the inter prediction process. The syntax element“store_ref_base_pic_flag” specifies whether (when equal to 1) or not(when equal to 0) to store the base representation of the currentpicture for future pictures to use for inter prediction.

NAL units with “quality_id” greater than 0 do not contain syntaxelements related to reference picture lists construction and weightedprediction, i.e., the syntax elements “num_ref_active_lx_minus1” (x=0 or1), the reference picture list reordering syntax table, and the weightedprediction syntax table are not present. Consequently, the MGS or FGSlayers have to inherit these syntax elements from the NAL units with“quality_id” equal to 0 of the same dependency unit when needed.

The leaky prediction technique makes use of both base representationsand decoded pictures (corresponding to the highest decoded“quality_id”), by predicting FGS data using a weighted combination ofthe base representations and decoded pictures. The weighting factor canbe used to control the attenuation of the potential drift in theenhancement layer pictures. More information on leaky prediction can befound in H. C. Huang, C. N. Wang, and T. Chiang, “A robust finegranularity scalability using trellis-based predictive leak,” IEEETrans. Circuits Syst. Video Technol., vol. 12, no. 6, pp. 372-385, June2002.

When leaky prediction is used, the FGS feature of the SVC is oftenreferred to as Adaptive Reference FGS (AR-FGS). AR-FGS is a tool tobalance between coding efficiency and drift control. AR-FGS enablesleaky prediction by slice level signaling and MB level adaptation ofweighting factors. More details of a mature version of AR-FGS can befound in JVT-W119: Yiliang Bao, Marta Karczewicz, Yan Ye “CE1 report:FGS simplification,” JVT-W119, 23rd JVT meeting, San Jose, USA, April27, available atftp3.itu.ch/av-arch/jvt-site/27_(—)04_SanJose/JVT-W119.zip.

A value of picture order count (POC) is derived for each picture and isnon-decreasing with increasing picture position in output order relativeto the previous IDR picture or a picture containing a memory managementcontrol operation marking all pictures as “unused for reference.” POCtherefore indicates the output order of pictures. POC is also used inthe decoding process for implicit scaling of motion vectors in thetemporal direct mode of bi-predictive slices, for implicitly derivedweights in weighted prediction, and for reference picture listinitialization of B slices. Furthermore, POC is used in verification ofoutput order conformance.

Values of POC can be coded with one of the three modes signaled in theactive sequence parameter set. In the first mode, the selected number ofleast significant bits of the POC value is included in each sliceheader. It may be beneficial to use the first mode when the decoding andoutput order of pictures differs and the picture rate varies. In thesecond mode, the relative increments of POC as a function of the pictureposition in decoding order in the coded video sequence are coded in thesequence parameter set. In addition, deviations from the POC valuederived from the sequence parameter set may be indicated in sliceheaders. The second mode suits bitstreams in which the decoding andoutput order of pictures differs and the picture rate stays exactly orclose to unchanged. In the third mode, the value of POC is derived fromthe decoding order by assuming that the decoding and output order areidentical. In addition, only one non-reference picture can occurconsecutively, when the third mode is used.

The reference picture lists construction in AVC can be described asfollows. When multiple reference pictures can be used, each referencepicture must be identified. In AVC, the identification of a referencepicture used for a coded block is as follows. First, all of thereference pictures stored in the DPB for prediction reference of futurepictures is either marked as “used for short-term reference” (referredto as short-term pictures) or “used for long-term reference” (referredto as long-term pictures). When decoding a coded slice, a referencepicture list is constructed. If the coded slice is a bi-predicted slice,a second reference picture list is also constructed. A reference pictureused for a coded block is then identified by the index of the usedreference picture in the reference picture list. The index is coded inthe bitstream when more than one reference picture may be used.

The reference picture list construction process is as follows. Forsimplicity, it is assumed herein that only one reference picture list isneeded. First, an initial reference picture list is constructedincluding all of the short-term and long-term reference pictures.Reference picture list reordering (RPLR) is then performed when theslice header contains RPLR commands. The RPLR process may reorder thereference pictures into a different order than the order in the initiallist. Both the initial list and the final list, after reordering,contains only a certain number of entries indicated by a syntax elementin the slice header or the picture parameter set referred by the slice.

During the initialization process, all of the short-term and long-termpictures are considered as candidates of reference picture lists for thecurrent picture. No matter current picture is B or P picture, andlong-term pictures are placed after the short-term pictures inRefPicList0 (and RefPicList1 available for B slices). For P pictures,the initial reference picture list for RefPicList0 contains allshort-term reference pictures ordered in descending order of PicNum.

For B pictures, those reference pictures obtained from all short termpictures are ordered by a rule related to current POC number and the POCnumber of the reference picture. For RefPicList0, reference pictureswith smaller POC (comparing to current POC) are considered first andinserted into the RefPictList0 with the descending order of POC.Pictures with larger POC are then appended with the ascending order ofPOC. For RefPicList1 (if available), reference pictures with larger POC(comparing to current POC) are considered first and inserted into theRefPicList1 with the ascending order of POC. Pictures with smaller POCare then appended with descending order of POC. After considering all ofthe short-term reference pictures, the long-term reference pictures areappended by the ascending order of LongTermPicNum, both for P and Bpictures.

The reordering process is invoked by continuous RPLR commands, includingfour type of commands: (1) A command to specify a short-term picturewith smaller PicNum (comparing to a temporally predicted PicNum) to bemoved; (2) a command to specify a short-term picture with larger PicNumto be moved; (3) a command to specify a long-term picture with a certainLongTermPicNum to be moved and (4) a command to specify the end of theRPLR loop. If a current picture is bi-predicted, there are two loops—onefor the forward reference list and one for the backward reference list.

The predicted PicNum referred to as picNumLXPred is initialized as thePicNum of the current coded picture and is set to the PicNum of the justmoved picture after each reordering process for a short-term picture.The difference between the PicNum of a current picture being reorderedand picNumLXPred is signaled in the RPLR command. The picture indicatedto be reordered is moved to the beginning of the reference picture list.After the reordering process is complete, a whole reference picture listis truncated based on the active reference picture list size, which isnum_ref idx_(—)1X_active_minus1+1 (X equal to 0 or 1 corresponds forRefPicList0 and RefPicList1 respectively).

In SVC, a reference picture list consists of either only baserepresentations (when “use_ref_base_pic_flag” is equal to 1) or onlydecoded pictures not marked as “base representation” (when“use_ref_base_pic_flag” is equal to 0), but never both at the same time.

In terms of reference picture marking, decoded pictures used forpredicting subsequent coded pictures and for future output are bufferedin the decoded picture buffer (DPB). To efficiently utilize the buffermemory, the DPB management processes, including the storage process ofdecoded pictures into the DPB, the marking process of referencepictures, output and removal processes of decoded pictures from the DPB,are specified.

The process for reference picture marking in AVC is summarized asfollows. The maximum number of reference pictures used for interprediction, referred to as M, is indicated in the active sequenceparameter set. When a reference picture is decoded, it is marked as“used for reference.” If the decoding of the reference picture causesmore than M pictures to be marked as “used for reference,” at least onepicture must be marked as “unused for reference.” The DPB removalprocess then removes pictures marked as “unused for reference” from theDPB if they are not needed for output as well.

There are two types of operation for the reference picture marking:adaptive memory control and sliding window. The operation mode forreference picture marking is selected on a picture basis. The adaptivememory control requires the presence of memory management controloperation (MMCO) commands in the bitstream. The memory managementcontrol operations enable explicit signaling which pictures are markedas “unused for reference,” assigning long-term frame indices toshort-term reference pictures, storage of the current picture aslong-term picture, changing a short-term picture to the long-termpicture, and assigning the maximum allowed long-term frame index(MaxLongTermFrameIdx) for long-term pictures. If the sliding windowoperation mode is in use and there are M pictures marked as “used forreference,” then the short-term reference picture that was first decodedpicture among those short-term reference pictures that are marked as“used for reference” is marked as “unused for reference.” In otherwords, the sliding window operation mode results in first-in-first-outbuffering operation among shortterm reference pictures.

Each short-term picture is associated with a variable PicNum that isderived from the syntax element “frame_num,” and each long-term pictureis associated with a variable LongTermPicNum that is derived from the“long_term_frame_idx” which is signaled by MMCO command.

PicNum is derived from FrameNumWrap depending on whether frame or fieldis coded or decoded. For frames where PicNum equal to FrameNumWrap.FrameNumWrap is derived from FrameNum, and FrameNum is derived fromframe_num. For example, in AVC frame coding, FrameNum is assigned thesame as frame_num and FrameNumWrap is defined as below: if(FrameNum>frame_num) FrameNumWrap=FrameNum−MaxFrameNum elseFrameNumWrap=FrameNum.

LongTermPicNum is derived from the long-term frame index(LongTermFrameIdx) assigned for the picture. For frames, LongTermPicNumis equal to LongTermFrameIdx.

“frame_num” is a syntax element in each slice header. The value of“frame_num” for a frame or a complementary field pair essentiallyincrements by one, in modulo arithmetic, relative to the “frame_num” ofthe previous reference frame or reference complementary field pair. InIDR pictures, the value of “frame_num” is zero. For pictures containinga memory management control operation marking all pictures as “unusedfor reference,” the value of “frame_num” is considered to be zero afterthe decoding of the picture.

The MMCO commands use PicNum and LongTermPicNum for indicating thetarget picture for the command as follows. (1) To mark a short-termpicture as “unused for reference,” the PicNum difference between currentpicture p and the destination picture r is to be signaled in the MMCOcommand. (2) To mark a long-term picture as “unused for reference,” theLongTermPicNum of the to-be-removed picture r is to be signaled in theMMCO command. (3) To store the current picture p as a long-term picture,a “long_term_frame_idx” is to be signaled with the MMCO command. Thisindex is assigned to the newly stored long-term picture as the value ofLongTermPicNum. (4) To change a picture r from short-term picture tolong-term picture, a PicNum difference between current picture p andpicture r is signaled in the MMCO command and the “long_term_frame_idx”is signaled in the MMCO command. The index is also assigned to the thislong-term picture.

In addition to the above reference picture marking concepts from AVC,the marking in SVC is supported as follows. The marking of a baserepresentation as “used for reference” is always the same as thecorresponding decoded picture. There is therefore no additional syntaxelements for marking base presentations as “used for reference.”However, marking base representations as “unused for reference” makesuse of separate MMCO commands, the syntax of which is not present inAVC, to enable optimal memory usage.

The hypothetical reference decoder (HRD), specified in Annex C ofH.264/AVC, is used to check bitstream and decoder conformances. The HRDcontains a coded picture buffer (CPB), an instantaneous decodingprocess, a decoded picture buffer (DPB), and an output picture croppingblock. The CPB and the instantaneous decoding process are specifiedsimilarly to any other video coding standard, and the output picturecropping block simply crops those samples from the decoded picture thatare outside the signaled output picture extents. The DPB was introducedin H.264/AVC in order to control the required memory resources fordecoding of conformant bitstreams. The DPB includes a unified decodedpicture buffering process for reference pictures and output reordering.A decoded picture is removed from the DPB when it is no longer used asreference and not needed for output. A picture is not needed for outputwhen either one of the two following conditions are fulfilled: thepicture was already output or the picture was marked as not intended foroutput with the “output_flag” that is present in the NAL unit header ofSVC NAL units. The maximum size of the DPB that bitstreams are allowedto use is specified in the Level definitions (Annex A) of H.264/AVC.

There are two types of conformance for decoders: output timingconformance and output order conformance. For output timing conformance,a decoder must output pictures at identical times compared to the HRD.For output order conformance, only the correct order of output pictureis taken into account. The output order DPB is assumed to contain amaximum allowed number of frame buffers. A frame is removed from the DPBwhen it is no longer used as reference and needed for output. When theDPB becomes full, the earliest frame in output order is output until atleast one frame buffer becomes unoccupied.

In multi-view video coding, video sequences output from differentcameras, each corresponding to different views, are encoded into onebit-stream. After decoding, to display a certain view, the decodedpictures belonging to that view are reconstructed and displayed. It isalso possible that more than one view is reconstructed and displayed.

Multi-view video coding has a wide variety of applications, includingfreeviewpoint video/television, 3D TV and surveillance.

A view component in MVC is referred to as a coded representation of aview in a single access unit. An anchor picture is a coded picture inwhich all slices may reference only slices within the same access unit,i.e., inter-view prediction may be used, but no inter prediction isused, and all following coded pictures in output order do not use interprediction from any picture prior to the coded picture in decodingorder. A base view in MVC is a view that has the minimum value of vieworder index in a coded video sequence. The base view can be decodedindependently of other views and does not use inter-view prediction. Thebase view can be decoded by H.264/AVC decoders supporting only thesingle-view profiles, such as the Baseline Profile or the High Profileof H.264/AVC.

Referring now to FIG. 2, an exemplary MVC decoding order (i.e. bitstreamorder) is illustrated. The decoding order arrangement is referred astime-first coding. Each access unit is defined to contain the viewcomponents of all the views for one output time instance. Note that thedecoding order of access units may not be identical to the output ordisplay order.

Referring now to FIG. 3, an exemplary MVC prediction (including bothinter-picture prediction within each view and inter-view prediction)structure for multi-view video coding is illustrated. In the illustratedstructure, predictions are indicated by arrows, the pointed-to objectusing the point-from object for prediction reference.

An anchor picture is a coded picture in which all slices reference onlyslices with the same temporal index, i.e., only slices in other viewsand not slices in earlier pictures of the current view. An anchorpicture is signaled by setting the “anchor_pic_flag” to 1. Afterdecoding the anchor picture, all following coded pictures in displayorder shall be able to be decoded without inter-prediction from anypicture decoded prior to the anchor picture. If anchor_pic_flag is equalto 1 for a view component, then all view components in the same accessunit also have anchor_pic_flag equal to 1. Consequently, decoding of anyview can be started from a temporal index that corresponds to anchorpictures. Pictures with “anchor_pic_flag” equal to 0 are namednon-anchor pictures.

In MVC, view dependencies are specified in the sequence parameter set(SPS) MVC extension. The dependencies for anchor pictures and non-anchorpictures are independently specified. Therefore anchor pictures andnon-anchor pictures can have different view dependencies. However, forthe set of pictures that refer to the same SPS, all the anchor pictureshave the same view dependency, and all the non-anchor pictures have thesame view dependency. In the SPS MVC extension, dependent views can besignaled separately for the views used as reference pictures inRefPicList0 and RefPicList1.

In MVC, there is an “inter_view_flag” in the network abstraction layer(NAL) unit header which indicates whether the current picture is notused or is allowed to be used for inter-view prediction for the picturesin other views.

In MVC, inter-view prediction is supported by texture prediction (i.e.,the reconstructed sample values may be used for inter-view prediction),and only the decoded view components of the same output time instance(i.e., the same access unit) as the current view component are used forinter-view prediction. The fact that reconstructed sample values areused in inter-view prediction also implies that MVC utilizes multi-loopdecoding. In other words, motion compensation and decoded view componentreconstruction are performed for each view.

For the purpose of many decoding processes in MVC, a decoded picture isoften used to mean a decoded view component. The process of constructingreference picture lists in MVC is summarized as follows.

First, a reference picture list is constructed including all theshort-term and long-term reference pictures that are marked as “used forreference” and belong to the same view as the current slice. Thoseshort-term and long-term reference pictures are named intra-viewreferences for simplicity. Then, inter-view reference pictures andinter-view only reference pictures are appended after the intra-viewreferences, according to the SPS and the “inter_view_flag,” to form aninitial list. Reference picture list reordering (RPLR) is then performedwhen the slice header contains RPLR commands. The RPLR process mayreorder the intra-view reference pictures, inter-view reference picturesand inter-view only reference pictures into a different order than theorder in the initial list. Both the initial list and final list afterreordering must contain only a certain number of entries indicated by asyntax element in the slice header or the picture parameter set referredby the slice.

Reference picture marking is performed identically to H.264/AVC for eachview independently as if other views were not present in the bitstream.

The DPB operation is similar to that of H.264/AVC except for thefollowing. Non-reference pictures (with “nal_ref_idc” equal to 0) thatare used as for inter-view prediction reference are called inter-viewonly reference pictures, and the term “interview reference pictures”only refer to those pictures with “nal_ref_idc” greater than 0 and areused for inter-view prediction reference. In some draft versions of MVC,inter-view only reference pictures are marked as “used for reference”,stored in the DPB, implicitly marked as “unused for reference” afterdecoding the access unit, and implicitly removed from the DPB when theyare no longer needed for output and inter-view reference.

In MVC, after the first byte of NAL (Network Abstraction Layer) unit, aNAL unit header extension (3 bytes) is followed. The NAL unit headerextension includes the syntax elements that describe the properties ofthe NAL unit in the context of MVC.

Many display arrangements for multi-view video are based on rendering ofa different image to viewer's left and right eyes. For example, whendata glasses or auto-stereoscopic displays are used, only two views areobserved at a time in typical MVC applications, such as 3D TV, althoughthe scene can often be viewed from different positions or angles. Basedon the concept of asymmetric coding, one view in a stereoscopic pair canbe coded with lower fidelity, while the perceptual quality degradationcan be negligible. Thus, stereoscopic video applications may be feasiblewith moderately increased complexity and bandwidth requirement comparedto mono-view applications, even in the mobile application domain.

As backward compatibility is important in practice, a so-calledasymmetric stereoscopic video (ASV) codec can encode the base view (view0) as H.264/AVC compliant and the other view (view 1) with techniquesspecified in H.264/AVC as well as inter-view prediction methods.Approaches have been proposed to realize an ASV codec by invoking adownsampling process before inter-view prediction.

However, it is desirable to design the coding of low-resolution view ina manner with low computational complexity and high compressionefficiency. A low complexity motion compensation (MC) scheme has beenproposed to substantially reduce the complexity of asymmetric MVCwithout compression efficiency loss. Direct motion compensation withouta downsampling process from the high resolution inter-view picture tothe low resolution picture was proposed in Y. Chen, Y.-K. Wang, M. M.Hannuksela, and M. Gabbouj, “Single-loop decoding for multiview videocoding,” in Proceedings of IEEE International Conference on Multimedia &Expo (ICME), June 2008, In direct motion compensation, the block ofsamples referred to by a motion vector pointing to an inter-viewreference picture is sub-sampled to form a prediction block, i.e., onlya subset of the sample values of the block in the inter-view referencepicture is included in the prediction block. In another version ofdirect motion compensation, a filter is applied over several samples inthe inter-view reference picture to obtain a sample in the predictionblock. This version of direct motion compensation is described in Y.Chen, Y.-K. Wang, M. Gabbouj, and M. M. Hannuksela, “Regionally adaptivefiltering for asymmetric stereoscopic video coding,” in Proceedings ofIEEE International Symposium on Circuits and Systems (ISCAS), May 2009.

As noted above, compressed multi-view video sequences require aconsiderable bitrate. They may have been coded for a spatial resolution(picture size in terms of pixels) or picture quality (spatial details)that are unnecessary for a display in use or unfeasible for acomputational or memory capacity in use while being suitable for anotherdisplay and another computational complexity and memory resources inuse. In many systems, it would therefore be desirable to adjust thetransmitted or processed bitrate, the picture rate, the picture size, orthe picture quality of a compressed multi-view video bitstream. Thecurrent multi-view video coding solutions offer scalability only interms of view scalability (selecting which views are decoded) ortemporal scalability (selecting the picture rate at which the sequenceis decoded).

It is non-trivial to realize a multi-view video coding scheme where eachview is coded with a scalable video codec and where inter-viewprediction is enabled. It may not be possible to perform scalableadaptation of individual views cannot be done without causing aprediction drift in inter-view prediction, or multiple decoding loopswithin a view may be required and a lower compression efficiency may beachieved.

The following works proposed multiview video coding with spatialscalability, but inter-view prediction was used only between the decodedview components of the base layer of each view: N. Ozbek, A. M. Tekalp,and E. T. Tunali, “A New Scalable Multi-view Video Coding Configurationfor Robust Selective Streaming of Free-Viewpoint TV,” Proc. of IEEEInternational Conference on Multimedia & Expo (ICME), pp. 1155-1158,2007; and E. Kurutepe, M. R. Civanlar, and A. M. Tekalp, “Client-drivenselective streaming of multiview video for interactive 3DTV,” IEEETransactions on Circuits and Systems for Video Technology, vol. 17, no.11, pp. 1558-1565, November 2007. The coding scheme presented in theseworks is disadvantageous when it comes to use of computational andmemory resources, because when inter-view prediction is applied betweenthe decoded pictures of the base layer, and, at the same time, interprediction is allowed between the decoded pictures of the enhancementlayer, two decoding loops are required per view. In other words,dependency representations of the base layer (having dependency_id equalto 0) for each view have to be entirely reconstructed in addition to thedecoding the dependency representations with dependency_id greater than0. Furthermore, the lack of inter-view prediction for spatialenhancement layers has a negative impact on the coding efficiency.

When spatial scalability is applied and inter-view prediction is appliedfrom a spatial enhancement layer of a view, removal of the spatialscalable enhancement layer would cause inter-view prediction from theview containing the spatial scalable enhancement layer to fail, as thereference picture used for inter-view prediction can be of a differentspatial resolution or, in case of extended spatial scalability, cover adifferent region of the original uncompressed picture.

In another example, when coarse granular scalability (CGS) or mediumgrain scalability (MGS) is applied, removal of a CGS or MGS enhancementlayer would cause a prediction drift in inter-view prediction from theview containing the CGS or MGS enhancement layer, as different decodedpictures would be used as prediction in the decoder and in the encoder.The decoder would use a decoded picture resulting from the bitstreamwhere the CGS or MGS enhancement layer is not present, whereas theencoder used a decoded picture resulting from the bitstream where theCGS or MGS enhancement layer was present.

In accordance with embodiments of the present invention, a multi-viewvideo coding scheme is provided where at least one view is coded with ascalable video coding scheme. In one particular embodiment, a multi-viewvideo coding extension of the Scalable Video Coding (SVC) standard isprovided. In another particular embodiment a scalable video codingextension of the Multiview Video Coding (MVC) standard is provided.

Embodiments of the present invention provide a codec design that enablesany view in a multi-view bitstream to be coded in a scalable fashion sothat scalable layers can be pruned unevenly between views. In oneembodiment, the inter-view prediction from the base view is adapted onthe basis of which scalable layers are present in the non-base view. Areference picture marking design and a reference picture listconstruction design are provided to enable the use of any dependencyrepresentation from any other view earlier in view order than thecurrent view for inter-view prediction.

For the dependency representation used for inter-view prediction, thereference picture marking design and reference picture list constructiondesign in accordance with embodiments of the present invention allow forselective use of base representation or enhancement representation ofthe dependency representation for inter-view prediction. The enhancementrepresentation of a dependency representation may result from decodingof a MGS layer representation or a FGS layer representation.

In FIG. 8 an example of a scalable stereoscopic coding scheme enablingbitstream pruning to a mixed-resolution stereoscopic video is presented.The base view 810 is coded in a non-scalable manner with H.264/AVC. Thenon-base view 820 is coded in a spatially scalable manner with SVCincluding inter-view prediction. A decoded picture of the base view canbe used as inter-view prediction reference for a dependencyrepresentation of the non-base view having the same spatial resolution.This is illustrated with arrows 816 in FIG. 8. Inter-view prediction canbe allowed for dependency representation having any temporal_id, notjust for dependency representations having temporal_id equal to 0 asillustrated in the FIG. 8. In FIGS. 8 and 9 the size of the squares 814,824, 826 inside the view components 812, 822 (squares with dotted lines)illustrates the relative sample count enclosed by the dependencyrepresentation. A smaller square 826 illustrates a smaller sample countthan a larger square 814,

In the non-base view 820 of the example of FIG. 8 the smaller squares826 illustrate dependency representations having dependency_id equal to0 and the larger squares 824 illustrate dependency representationshaving dependency_id equal to 1. In practical situations there may alsobe other dependency representations having a different (a higher)dependency_id than 0 or 1.

The coded non-base view can be manipulated to achieve a mixed-resolutionbitstream by excluding dependency representations having the highestdependency_id value, in this example the highest value of thedependency_id is equal to 1. In some embodiments, a mixed-resolutionbitstream can be achieved by excluding more than one dependencyrepresentation per access unit, with the constraint the excludeddependency representations have higher dependency_id values than thosedependency representations remaining in the same view. After removal ofone or more dependency representation per access unit, inter-viewprediction references may be of different spatial resolution compared tothe view components of the non-base view being encoded/decoded, andhence the inter-view prediction process may be adapted. Basically, thedecoded base-view pictures are downsampled prior to using them asinter-view prediction references. Alternatively, direct inter-viewprediction as described in the following publications may be applied: Y.Chen, Y.-K. Wang, M. Gabbouj, and M. M. Hannuksela, “Regionally adaptivefiltering for asymmetric stereoscopic video coding,” Proc. of IEEEInternational Symposium on Circuits and Systems (ISCAS), May 2009; Y.Chen, Y.-K. Wang, M. M. Hannuksela, and M. Gabbouj, “Picture-leveladaptive filter for asymmetric stereoscopic video,” Proc. of IEEEInternational Conference on Image Processing (ICIP), October 2008; or Y.Chen, S. Liu, Y.-K. Wang, M. M. Hannuksela, H. Li, and M. Gabbouj,“Low-complexity asymmetric multiview video coding,” Proc. of IEEEInternational Conference on Multimedia & Expo (ICME), June 2008.

An example of the inter-view prediction process with downsampling ordirect inter-view prediction is illustrated in FIG. 9. The dependencyrepresentations of the non-base view components 822 may be predictedusing inter-view prediction from the base view 810 and inter predictionwithin the non-base view 820. Because the decoded view components 822 ofthe non-base view in FIG. 9 have a different spatial resolution than thedecoded view components of the base view, the inter-view referencepictures 814 are down- or upsampled before or during the inter-viewprediction. This is illustrated in FIG. 9 as dotted arrows 816 from someview components 812 of the base view to view components 822 of thenon-base view within the same access unit.

In some embodiments the encoder may operate as follows. The encoderreceives 1002 (FIG. 10) two or more video signals (views) and encodes1004 them to obtain different scalability layers. One of the videosignals may represent a base view and the other video signal(s)represent non-base view(s). The base view may be encoded in anon-scalable manner and a non-base view may be encoded to obtaindifferent scalability layers (dependency representations). The non-baseview may contain dependency representations having dependency_id equalto 0 or 1. The encoder reconstructs 1008 decoded pictures havingdependency_id equal to 0 and dependency_id equal to 1. For inter-viewprediction of the dependency representations having a different spatialresolution 1010 than the decoded pictures of the base view, thereference pictures of the base view are resampled 1012. The resamplingmay be performed e.g. by filtering the reference pictures, by selectinga smaller set of samples from the reference pictures, or by usinganother applicable method to obtain smaller resolution pictures.Resampled reference pictures may be stored into a reference picturememory of the encoder, for example.

A resampled reference picture can be removed from the reference picturememory when it is no longer needed for inter-view reference. In someembodiments, the inter-view motion vectors may be constrained andresampling can be done in a sliding window manner, e.g. one resampledmacroblock row can be added into the bottom of the sliding window whenthe top-most macroblock row of the sliding window is removed. In someimplementations, resampling may be done in-place, i.e., only for theinter-view prediction block.

When the encoder has changed the resolution of one or more of the views,the encoder may include one or more indications 1014 into the bitstreamfacilitating the detection of one or more of the following.

A change in the maximum dependency_id value at the present viewcomponent requires resampling of inter-view reference pictures only, orno resampling at all, if the decoding of the view component having thenew maximum dependency_id value results into the same spatial resolutionas the inter-view reference pictures have. This is equivalent to IDRpicture in a single-view H.264/AVC coding. When considering the exampleof FIG. 8, a corresponding indication, such asview_resolution_change_property equal to 0, may be associated with thenon-base view component of the first anchor picture.

A change in the maximum dependency_id value at the present viewcomponent may require resampling of inter-view reference pictures. Inaddition, resampling of reference pictures for inter prediction ofdependency representations preceding the current dependencyrepresentation in output order and following the current dependencyrepresentation in decoding order may be required. This is equivalent toopen GOP intra picture in single-view coding. Bit-exact decoding ofso-called leading pictures (/dependency representations) might not bepossible. When considering the example of FIG. 8, a correspondingindication, such as view_resolution_change_property equal to 1, may beassociated with the non-base view component of the second anchorpicture.

A change in the maximum dependency_id value at the present viewcomponent may require resampling of inter and inter-view referencepictures. Bit-exact decoding of dependency representations preceding thenext dependency representation causing a decoding refresh might not bepossible. When considering the example of FIG. 8, a correspondingindication, such as view_resolution_change_property equal to 2, may beassociated with the non-base view component of the second access unithaving temporal_id equal to 0.

The above-mentioned one or more indications may be included in one ormore various syntax structures, such as NAL unit header, prefix NALunit, payload content scalability information (PACSI) NAL unit,supplemental enhancement information message, a slice header, a pictureheader, a picture parameter set, and a sequence parameter set (where theindications may be associated to view components having certaintemporal_id values). In addition or alternatively, the above-mentionedone or more indications may be included in metadata in a fileencapsulating the video bitstream or in a header field of a packetencapsulating at least a part of the video bitstream, such as aReal-time Transport Protocol (RTP) payload header or a RTP packetheader.

In a bitstream conversion operation, some of the view components aremodified by pruning dependency representations. The conversion mayhappen in the sender 130, the gateway 140, the receiver 150, or thedecoder 160. The sender 130 may send the bitstream to the gateway 140which may forward the bitstream to the receiver 150 which may providethe bitstream to the decoder for decoding and possibly for presentingthe decoded presentation to a viewer. If the sender 130 decides toconvert the bitstream, the sender 130 prunes one or more dependencyrepresentations from the bitstream before sending it to the gateway 140and may provide in the bitstream an indication of the pruning.Correspondingly, if the gateway 140 decides to convert the bitstream,the gateway 140 prunes one or more dependency representations from thebitstream before sending it to the receiver 150 and may provide in thebitstream an indication of the pruning. If the receiver 150 decides toconvert the bitstream, the receiver 150 prunes one or more dependencyrepresentations from the bitstream before providing the bitstream to thedecoder 160 and may also provide to the decoder 150 an indication of thepruning.

The decision to convert may happen on the basis of e.g. one or more ofthe following situations. A downlink throughput is estimated or reportede.g. by the gateway 140 or by the sender 130 to be lower than thebitrate of the bitstream. Hence, bitrate adaptation of the bitstream isneeded to reduce the bitrate of the bitstream. The computational ormemory capacity of the decoder may not be sufficient for the decoding ofthe entire bitstream. Hence, the decoder 160 may inform the receiver 150to adapt the bitrate. It may also happen that the viewer of the videorepresentation is detected or estimated to be so far from the displaythat the perceptual quality of mixed-resolution stereoscopic video isapproximately equal to that of full-resolution stereoscopic videowherein bitrate may be adapted to a lower level. Also if there are moredata streams transmitted through the network and/or processed by thedecoding device, and the perceptual quality decrease caused bymixed-resolution stereoscopic or multiview video is estimated to be lessannoying than bitrate and/or complexity adaptation of the other datastreams, the resolution of one or more views of the stereoscopic ormultiview video may be decreased.

The converter 180, which may be located in or attached with the sender130, the gateway 140, the receiver 150, and/or the decoder 160, may readone or more indications from the bitstream or from packets headers oralike associated with the bitstream facilitating the detection of whichdecoded reference pictures may have to be resampled and whether a driftin sample values of the decoded pictures may be possible. The convertermay decide the access unit or view component on which a change in themaximum dependency_id value is made based on its knowledge how thedecoder supports reference picture resampling (resampling for inter-viewreference pictures only or resampling of inter-view and inter referencepictures). Alternatively or in addition, the converter may decide theaccess unit or view component on which a change in the maximumdependency_id value is made based on the existence and potentialduration of drift in sample values (no drift, drift only in leadingpictures, drift until the next refresh dependency representation).

The converter may prune NAL units of the selected view on the basis oftheir dependency_id value in accordance with the sub-bitstreamextraction process of clause G.8.8.1 of H.264/AVC.

The decoder may detect if an inter-view reference picture has adifferent spatial resolution than the non-base view component beingdecoded. That being the case, the decoder resamples the inter-viewreference picture to the same spatial resolution as the non-base viewcomponent being decoded. Then, the decoder decodes the non-base viewcomponent using the resampled inter-view reference picture forinter-view prediction. As mentioned above, the resampling may also bedone in a sliding window manner, e.g. one resampled macroblock row at atime, or in-place, i.e., only for one inter-view prediction block at atime.

In some embodiments the decoder may operate as follows. The decoderreceives 1102 (FIG. 11) an encoded bitstream containing view componentsof two or more video signals and decodes the bit stream to reconstructthe original view components. The bitstream may contain data units (e.g.NAL units) in which the encoded view components have been transmitted.The decoder (or the receiver 150) buffers the view components andrearranges them into a decoding order if the decoding order is differentfrom the transmission order (block 1104). The decoder may also examinee.g. by using a reference picture list to determine whether the viewcomponents is used as a reference. If so, the decoder may mark 1106 thereference view components as “used for reference”. The decoder may alsoexamine 1108 whether the resolution of the inter-view reference viewcomponents differ from the resolution of the view components to bepredicted on the basis of the inter-view reference view components. Theinter-view reference view components are resampled 1110 to theresolution corresponding with the resolution of the view components tobe predicted.

The decoder decodes 1112 the view components using reference viewcomponents in the decoding when the view components are predicted viewcomponents.

If the spatial resolution of one or more views changes 1116, thecorresponding inter-view reference view components may be resampled, ifnecessary.

The decoded view components can be provided 1114 to a renderer fordisplaying, to a memory for storing, etc.

The above processed may be repeated 1116 until the whole bitstream hasbeen received and decoded.

In one embodiment, the resolutions of the base view and the resolutionof the base layer of the non-base view are the same. The non-base viewhas an enhancement layer increasing the resolution compared to that ofthe base layer. For the encoding/decoding of the enhancement layer, theinter-view reference pictures are (implicitly) upsampled. Suchembodiment can be used to provide a possibility for mixed-resolutionimprovement to a symmetric bitstream, such as a standard-compliant MVCbitstream.

In one embodiment, more than two views are coded where one or more viewsis coded in spatially scalable manner. Resampling of inter-viewreference pictures is applied whenever a view is coded/decoded at adifferent resolution than its reference view. A pruning order indicatormay be used to indicate the intended order of pruning spatial layersfrom the multiview bitstream. An encoder or a bitstream analyzer maycreate the values of the pruning order indicator based on the referencepictures it has used for inter-view prediction. Encoders may selectinter, inter-layer and inter-view prediction references such a way thatany bitstream extraction performed according to the pruning orderindicator results into a valid bitstream. A pruning order indication maybe included in the bitstream, metadata included in a file encapsulatingthe bitstream, or a packet header or alike encapsulating at least a partof the bitstream. The pruning order indicator can be realized with a“priority_id” syntax element included in each NAL unit header. Abitstream subset containing all the NAL units having pruning ordervalues less than or equal to any chosen value is a valid bitstream. Forexample, a bitstream may contain three views, where view 2 depends onview 1 and view 1 depends on view 0 (the base view) and at least views 1and 2 are have spatial scalability layers with equal resolution acrossthe views. Then, pruning order indicator may indicate that the spatialenhancement layer of view 2 is to be pruned before the spatialenhancement layer of view 1. Consequently, the base layer of view 2 isinter-view predicted from the downsampled decoded view components ofview 1 (decoded using both its base and enhancement layers).

In some embodiments the base view may also be scalably coded and thespatial resolution of the base view may be changed. It may also bepossible that the non-base view is coded in a non-scalable manner andthe base view is coded in a spatially scalable manner.

If spatially scalable view components of a reference view are used asinter-view prediction references for view components of a second view,it may become ambiguous whether a resampled decoded view componentresulting from decoding the highest dependency representation or thedecoded view component resulting from decoding the dependencyrepresentation having the same spatial resolution as the view componentin the second view should be decoded. If the encoder has used theresampled decoded view component resulting from decoding the highestdependency representation but the highest dependency representation hasbeen subsequently removed by a converter or alike, the decoder typicallyuses the decoded view component resulting from decoding the dependencyrepresentation having the same spatial resolution as the view componentin the second view as the inter-view prediction reference. If theencoder has used the decoded view component resulting from decoding thedependency representation having the same spatial resolution as the viewcomponent in the second view and no dependency representation from thereference view has been subsequently removed by a converter or alike,the decoder should reconstruct both the resampled decoded view componentresulting from decoding the highest dependency representation and thedecoded view component resulting from decoding the dependencyrepresentation having the same spatial resolution as the view componentin the second view. Hence, multiple decoded pictures per view per accessunit are required to be decoded.

In order to control the decoding of the required inter-view predictionreference pictures when spatially scalable view components of areference view are used as inter-view prediction references for viewcomponents of a second view, the encoder may operate as follows. Theencoder may set one or more indications, such as a inter_view_ubp_flagequal to 1, for those access units or view components when it uses thelowest dependency representation for inter-view reference of a viewcomponent in a second view. Two decoded view components for thereference view component are typically reconstructed by the encoder andthe decoder when inter_view_ubp_flag is equal to 1, one (so-calledinter-view reference picture) from the dependency representation withdependency_id equal to 0 and another one from all dependencyrepresentations that are present. As the lowest dependencyrepresentation is always present regardless of potential pruningoperations, the potential mismatch between the encoder and decoderreconstructions is stopped when inter_view_ubp_flag is equal to 1. Theencoder may therefore adaptively select the interval of view componentin the reference view for which inter_view_ubp_flag is equal to 1. Thehigher the frequency of view components with inter_view_ubp_flag equalto 1 is, the shorter in duration the potential mismatch periods are butalso the higher the computational complexity for decoding is. In someembodiments, leaky inter-view prediction is used, where multi-hypothesisprediction, such as bi-prediction, is used and the weight of theprediction blocks from the inter-view reference base pictures isadaptively selected.

An exemplary codec in accordance with embodiments of the presentinvention is described below. All the processes that are specified inSVC and MVC apply as such or are modified in the description of theexemplary codec below.

NAL Unit Header

The NAL unit syntax (i.e., nal_unit( )), is as specified in MVC. Thesyntax and semantics of the proposed NAL unit header are as follows. Thefirst byte of the NAL unit header consists of forbidden_zero_bit (1bit), nal_ref_idc (2 bits), and nal_unit_type (5 bits), same as inH.264/AVC, SVC, and MVC. The rest of the bytes of the NAL unit headerare contained in the syntax structure nal_unit_header_svc_mvc_extension() defined as follows:

nal_unit_header_svc_mvc_extension( ) { C Descriptor   svc_mvc_extension_flag All u(1)    idr_flag All u(1)    priority_idAll u(6)    no_inter_layer_pred_flag All u(1)    dependency_id All u(3)   quality_id All u(4)    temporal_id All u(3)    use_ref_base_pic_flagAll u(1)    discardable_flag All u(1)    output_flag All u(1)   reserved_three_2bits All u(2)    anchor_pic_flag All u(1)    view_idAll u(10)    inter_view_flag All u(1)    inter_view_ubp_flag All u(1)   reserved_seven_3bits All u(3) }

The semantics of forbidden_zero_bit, nal_ref_idc and nal_unit_type areas specified in SVC, with the following additions.

NAL units with “nal_unit_type” equal to 1 to 5 is only used for the baseview as specified in MVC. Within the base view, the use of NAL unitswith “nal_unit_type” equal to 1 to 5 is as specified in SVC. Prefix NALunits shall only appear in the base layer in the base view. The baselayer in the base view is as specified in SVC. For non-base views, codedslice NAL units with “dependency_id” equal to 0 and “quality_id” equalto 0 have “nal_unit_type” equal to 20, and prefix NAL units are not bepresent.

When the current NAL unit is a prefix NAL unit with “nal_unit_type”equal to a value reserved for the scalable multiview coding slices, allthe syntax elements in “nal_unit_header_svc_mvc_extension( )” also applyto the NAL unit that directly succeeds the prefix NAL unit in decodingorder. An NAL unit that directly succeeds a prefix NAL unit isconsidered to contain these syntax elements with values identical tothat of the prefix NAL unit.

“svc_mvc_extension_flag” is reserved for future extensions and is set to0.

“idr_flag” equal to 1 specifies that the current access unit is an IDRaccess unit when all the view components in the current access unit areIDR view components. A view component consists of all the NAL units inone access unit having identical “view_id.” An IDR view component refersto a view component for which the dependency representation with thegreatest value of “dependency_id” among all the dependencyrepresentations within the view component has “idr_flag” equal to 1 or“nal_unit_type” equal to 5.

The semantics of “priority_id” are as specified in MVC.

The semantics of “no_inter_layer_pred_flag” are as specified in SVC.

“dependency_id” specifies a dependency identifier for the NAL unit.“dependency_id” is equal to 0 in VCL prefix NAL units. NAL units havingthe same value of “dependency_id” within one view comprise a dependencyrepresentation. Within a bitstream, a dependency representation isidentified by a pair of “view_id” and “dependency_id” values.

The semantics of “quality_id” is as specified in SVC, with the followingapplies in addition. NAL units having the same value of “quality_id”within one dependency representation comprise a layer representation.Within a bitstream, a layer representation is identified by a set of“view_id,” “dependency_id” and “quality_id” values.

The semantics of temporal_id are as specified in MVC.

“use_ref_base_pic_flag” equal to 1 specifies that reference basepictures (also referenced to as base representations) are used asreference pictures for the inter prediction process.“use_ref_base_pic_flag” equal to 0 specifies that decoded pictures (alsoreferred to as enhancement representations) are used as referencepictures during the inter prediction process. The values of“use_ref_base_pic_flag” is the same for all NAL units of a dependencyrepresentation. “use_ref_base_pic_flag” is equal to 0 in filler prefixNAL units.

“discardable_flag” equal to 1 specifies that the current NAL unit is notused for decoding NAL units of the current view component and allsubsequent view components of the same view that have a greater value of“dependency_id” than the current NAL unit. “discardable_flag” equal to 0specifies that the current NAL unit may be used for decoding NAL unitsof the current view component and all subsequent view components of thesame view that have a greater value of “dependency_id” than the currentNAL unit. “discardable_flag” is equal to 1 in filler prefix NAL units.

The semantics of “output_flag” and “reserved_three_(—)2 bits” are asspecified in SVC.

“anchor_picture_flag” equal to 1 specifies that the current viewcomponent is an anchor picture as specified in MVC when the value of“dependency_id” for the NAL unit is equal to the maximum value of“dependency_id” for the view component. “anchor_picture_flag” isidentical for all NAL units within a dependency representation.

The semantics of “view_id” are as specified in MVC.

“inter_view_flag” equal to 0 specifies that the current dependencyrepresentation is not used for inter-view prediction. “inter_view_flag”equal to 1 specifies that the current dependency representation is usedfor inter-view prediction.

“inter_view_ubp_flag” equal to 1 specifies that the current dependencyrepresentation uses base representations for inter-view prediction. Abase representation for inter-view prediction is decoded from a viewcomponent with dependency_id equal to 0 and quality_id equal to 0 in thereference view in the same access unit as the current dependencyrepresentation. If the base representation for inter-view prediction isof different spatial resolution from the spatial resolution of thecurrent dependency representation, the base representation forinter-view prediction is re-sampled to the same resolution currentdependency representation. “inter_view_ubp_flag” equal to 0 specifiesthat the current dependency representation does not use baserepresentations for inter-view prediction.

The values of “inter_view_ubp_flag” are the same for all NAL units of adependency representation.

“reserved_seven_(—)3 bits” shall be equal to 7. Decoders shall ignorethe value of “reserved_seven_(—)3 bits.”

Prefix NAL Unit

The prefix NAL unit RBSP syntax, “prefix_nal_unit_rbsp( )” and thesemantics of the fields therein are as specified in SVC.

Reference Picture Marking

Reference picture marking as specified in SVC applies independently foreach view. Note that inter-view only reference pictures (with“nal_ref_idc” equal to 0 and “inter_view_flag” equal to 1) are notmarked by the reference picture marking process.

Reference Picture List Construction

In one embodiment, the reference picture lists construction process isdescribed as follows. A variable biPred is derived as follows:

If the current slice currSlice is a B or EB slice, biPred is set equalto 1;

Otherwise, biPred is set equal to 0.

A reference picture list initialization process is invoked as specifiedin subclause G.8.2.3 of SVC (excluding the reordering process forreference picture lists). After that, an appending process forinter-view reference pictures and interview only reference pictures asspecified in subclause H.8.2.1 of MVC is invoked with the followingmodification. During the invocation of the appending process, if thecurrent slice has “inter_view_ubp_flag” equal to 1, then only baserepresentations for inter-view prediction are considered; Otherwise thedecoded pictures (i.e. enhancement representations) are considered forinter-view prediction.

The initial reference picture lists RefPicList0 and, when biPred isequal to 1, RefPicList1 are modified by invoking the reordering processfor reference picture lists as specified in subclause H.8.2.2.2 of MVC.During the reordering process in subclause H.8.2.2.2, if a viewcomponent which is not belonging to the current view is targeted forreordering, when “inter_view_ubp_flag” is equal to 1 for the currentslice, the decoded base picture for inter-view prediction of that viewcomponent is used, otherwise, the decoded picture (i.e. the enhancementrepresentation) of that view component is used.

In accordance with a second embodiment, the reference picture listsconstruction process is described as follows. Note that this embodimentcan use the base representation of one inter-view reference picture orinter-view only reference picture and the enhancement representation ofanother inter-view reference picture or inter-view only referencepicture for coding of one slice. Extra syntax elements are added in thereference picture list reordering syntax table.

“use_inter_view_base_flag” equal to 0 indicates that for the currentview component being reordered, its base representation is to be addedinto the reference picture list. The value equal to 1 indicates that itsenhancement representation is to be added into the reference picturelist. The values of “use_inter_view_base_flag” may be such that alloccurrences of the same inter-view reference picture or inter-view onlyreference picture in the final reference picture list are either allbase representations or all enhancement representations.

The reference picture list construction processes are specified asfollows. A reference picture list

ref_pic_list_reordering( ){ C Descriptorif(slice_type!=I&&slice_type!=){ ref_pic_list_reordering_flag_l0 2 u(1)if(ref_pic_list_reordering_flag_l0) do{ reordering_of_pic_nums_idc 2ue(v) if(reordering_of_pic_nums_idc==0|| reordering_of_pic_nums_idc==1)abs_diff_pic_num_minus1 2 ue(v) elseif(reordering_of_pic_nums_idc==2)long_term_pic_num 2 ue(v) elseif(reordering_of_pic_nums_idc==4||reordering_of_pic_nums_idc==5) abs_diff_view_idx_minus1 2 ue(v)use_inter_view_base_flag 2 u(1) }while(reordering_of_pic_nums_idc!=3) }if(slice_type==B||slice_type==EB) { ref_pic_list_reordering_flag_l1 2u(1) if(ref_pic_list_reordering_flag_l1) do { reordering_of_pic_nums_idc2 ue(v) if(reordering_of_pic_nums_idc==0||reordering_of_pic_nums_idc==1) abs_diff_pic_num_minus1 2 ue(v)elseif(reordering_of_pic_nums_idc==2) long_term_pic_num 2 ue(v)elseif(reordering_of_pic_nums_idc==4|| reordering_of _pic_nums_idc==5)abs_diff_view_idx_minus1 2 ue(v) use_inter_view_base_flag 2 u(1) }while(reordering_of_pic_nums_idc!=3) } }

process is invoked as specified in subclause G.8.2.3 of SVC (excludingthe reordering process for reference picture lists). After that, anappending process for inter-view reference pictures and inter-view onlyreference pictures as specified in subclause H.8.2.1 of MVC is invokedwith the following modification. During the invocation of the appendingprocess, only the decoded pictures (i.e. enhancement representations)are considered.

The initial reference picture lists RefPicList0 and RefPicList1 (whenbiPred is equal to 1) are modified by invoking the reordering processfor reference picture lists as specified in subclause H.8.2.2.2 for MVC.During the reordering process in subclause H.8.2.2.2, if a viewcomponent which is not belonging to the current view is targeted forreordering, when “use_inter_view_base_flag” is equal to 1, the baserepresentation of that view component is used, otherwise (when the flagis equal to 0), the enhancement representation of that view component isused.

Decoding Process

For any view component, the dependency representation with the highestvalue for “dependency_id” is decoded. If inter_view_ubp_flag is equal to1, the base representation for inter-view prediction is additionallyreconstructed for view components used as inter-view reference. If adecoded view component or a base representation for inter-viewprediction is used for inter-view prediction and has a different spatialresolution than the view component being decoded, the decoded viewcomponent or the base representation for inter-view prediction(whichever is referred to in inter-view prediction) is re-sampled to thesame spatial resolution as the view component being decoded. Ifre-sampling is a down-sampling operation, for example a filter with taps{2, 0, −4, −3, 5, 19, 26, 19, 5, −3, −4, 0, 2}/64 may be used. Ifre-sampling is an up-sampling operation, the SVC up-sampling filter maybe used. As mentioned above, the resampling may also be done in asliding window manner, e.g. one resampled macroblock row at a time, orin-place, i.e., only for one inter-view prediction block at a time.Direct motion compensation or sub-sampling may also be used forre-sampling. Otherwise, the SVC decoding process is used with themodifications specified above.

Leaky Inter-View Prediction

One aspect of the invention allows so-called leaky inter-viewprediction. In other words, a prediction block can be formed by aweighted average of a base representation and an enhancementrepresentation of an inter-view reference picture or inter-view onlyreference picture. This feature can be used to control the potentialdrift propagation caused by inter-view prediction from quality-scalable(either MGS or FGS) views.

One way to realize leaky inter-view prediction is implemented in asimilar way as described above but both base representation andenhancement representation for one inter-view reference picture orinter-view only reference picture are allowed in a reference picturelist. Weighted bi-prediction is used to control the averaging between abase representation and an enhancement representation. In this case,only the semantics of the “use_inter_view_base_flag” is to be changedsuch that the constraint in the semantics does not apply. That is, thevalues of “use_inter_view_base_flag” need not be such that alloccurrences of the same interview reference picture or inter-view onlyreference picture in the final reference picture list are either baserepresentations or enhancement representations. In other words, a finalreference picture list can include both a base representation and anenhancement representation of the same inter-view reference picture orinter-view only reference picture.

Asymmetric Scalable Multi-View Coding

In accordance with embodiments of the invention, an encoder, a decoderand a bitstream for scalable asymmetric multi-view video coding may beprovided. When the spatial resolution of a decoded picture used asinter-view reference differs from that of the current picture,resampling of the inter-view reference picture or inter-view onlyreference picture is inferred and performed.

FIG. 4 shows a system 10 in which various embodiments of the presentinvention can be utilized, comprising multiple communication devicesthat can communicate through one or more networks. The system 10 maycomprise any combination of wired or wireless networks including, butnot limited to, a mobile telephone network, a wireless Local AreaNetwork (LAN), a Bluetooth personal area network, an Ethernet LAN, atoken ring LAN, a wide area network, the Internet, etc. The system 10may include both wired and wireless communication devices.

For exemplification, the system 10 shown in FIG. 4 includes a mobiletelephone network 11 and the Internet 28. Connectivity to the Internet28 may include, but is not limited to, long range wireless connections,short range wireless connections, and various wired connectionsincluding, but not limited to, telephone lines, cable lines, powerlines, and the like.

The exemplary communication devices of the system 10 may include, butare not limited to, an electronic device 12 in the form of a mobiletelephone, a combination personal digital assistant (PDA) and mobiletelephone 14, a PDA 16, an integrated messaging device (IMD) 18, adesktop computer 20, a notebook computer 22, etc. The communicationdevices may be stationary or mobile as when carried by an individual whois moving. The communication devices may also be located in a mode oftransportation including, but not limited to, an automobile, a truck, ataxi, a bus, a train, a boat, an airplane, a bicycle, a motorcycle, etc.Some or all of the communication devices may send and receive calls andmessages and communicate with service providers through a wirelessconnection 25 to a base station 24. The base station 24 may be connectedto a network server 26 that allows communication between the mobiletelephone network 11 and the Internet 28. The system 10 may includeadditional communication devices and communication devices of differenttypes.

The communication devices may communicate using various transmissiontechnologies including, but not limited to, Code Division MultipleAccess (CDMA), Global System for Mobile Communications (GSM), UniversalMobile Telecommunications System (UMTS), Time Division Multiple Access(TDMA), Frequency Division Multiple Access (FDMA), Transmission ControlProtocol/Internet Protocol (TCP/IP), Short Messaging Service (SMS),Multimedia Messaging Service (MMS), e-mail, Instant Messaging Service(IMS), Bluetooth, IEEE 802.11, etc. A communication device involved inimplementing various embodiments of the present invention maycommunicate using various media including, but not limited to, radio,infrared, laser, cable connection, and the like.

FIGS. 5 and 6 show one representative electronic device 28 which may beused as a network node in accordance to the various embodiments of thepresent invention. It should be understood, however, that the scope ofthe present invention is not intended to be limited to one particulartype of device. The electronic device 28 of FIGS. 5 and 6 includes ahousing 30, a display 32 in the form of a liquid crystal display, akeypad 34, a microphone 36, an ear-piece 38, a battery 40, an infraredport 42, an antenna 44, a smart card 46 in the form of a UICC accordingto one embodiment, a card reader 48, radio interface circuitry 52, codeccircuitry 54, a controller 56 and a memory 58. The electronic device 28may also include a camera 60. The above described components enable theelectronic device 28 to send/receive various messages to/from otherdevices that may reside on a network in accordance with the variousembodiments of the present invention. Individual circuits and elementsare all of a type well known in the art, for example in the Nokia rangeof mobile telephones.

FIG. 7 is a graphical representation of a generic multimediacommunication system within which various embodiments may beimplemented. As shown in FIG. 7, a data source 100 provides a sourcesignal in an analog, uncompressed digital, or compressed digital format,or any combination of these formats. An encoder 110 encodes the sourcesignal into a coded media bitstream. It should be noted that a bitstreamto be decoded can be received directly or indirectly from a remotedevice located within virtually any type of network. Additionally, thebitstream can be received from local hardware or software. The encoder110 may be capable of encoding more than one media type, such as audioand video, or more than one encoder 110 may be required to codedifferent media types of the source signal. The encoder 110 may also getsynthetically produced input, such as graphics and text, or it may becapable of producing coded bitstreams of synthetic media. In thefollowing, only processing of one coded media bitstream of one mediatype is considered to simplify the description. It should be noted,however, that typically real-time broadcast services comprise severalstreams (typically at least one audio, video and text sub-titlingstream). It should also be noted that the system may include manyencoders, but in FIG. 7 only one encoder 110 is represented to simplifythe description without a lack of generality. It should be furtherunderstood that, although text and examples contained herein mayspecifically describe an encoding process, one skilled in the art wouldunderstand that the same concepts and principles also apply to thecorresponding decoding process and vice versa.

The coded media bitstream is transferred to a storage 120. The storage120 may comprise any type of mass memory to store the coded mediabitstream. The format of the coded media bitstream in the storage 120may be an elementary self-contained bitstream format, or one or morecoded media bitstreams may be encapsulated into a container file. If oneor more media bitstreams are encapsulated in a container file, a filegenerator (not shown in the figure) may used to store the one more mediabitstreams in the file and create file format metadata, which is alsostored in the file. The encoder 110 or the storage 120 may comprise thefile generator, or the file generator is operationally attached toeither the encoder 110 or the storage Some systems operate “live”, i.e.omit storage and transfer coded media bitstream from the encoder 110directly to the sender 130. The coded media bitstream is thentransferred to the sender 130, also referred to as the server, on a needbasis. The format used in the transmission may be an elementaryself-contained bitstream format, a packet stream format, or one or morecoded media bitstreams may be encapsulated into a container file. Theencoder 110, the storage 120, and the server 130 may reside in the samephysical device or they may be included in separate devices. The encoder110 and server 130 may operate with live real-time content, in whichcase the coded media bitstream is typically not stored permanently, butrather buffered for small periods of time in the content encoder 110and/or in the server 130 to smooth out variations in processing delay,transfer delay, and coded media bitrate.

The server 130 sends the coded media bitstream using a communicationprotocol stack. The stack may include but is not limited to Real-TimeTransport Protocol (RTP), User Datagram Protocol (UDP), and InternetProtocol (IP). When the communication protocol stack is packet-oriented,the server 130 encapsulates the coded media bitstream into packets. Forexample, when RTP is used, the server 130 encapsulates the coded mediabitstream into RTP packets according to an RTP payload format.Typically, each media type has a dedicated RTP payload format. It shouldbe again noted that a system may contain more than one server 130, butfor the sake of simplicity, the following description only considers oneserver 130.

If the media content is encapsulated in a container file for the storage120 or for inputting the data to the sender 130, the sender 130 maycomprise or be operationally attached to a “sending file parser” (notshown in the figure). In particular, if the container file is nottransmitted as such but at least one of the contained coded mediabitstream is encapsulated for transport over a communication protocol, asending file parser locates appropriate parts of the coded mediabitstream to be conveyed over the communication protocol. The sendingfile parser may also help in creating the correct format for thecommunication protocol, such as packet headers and payloads. Themultimedia container file may contain encapsulation instructions, suchas hint tracks in the ISO Base Media File Format, for encapsulation ofthe at least one of the contained media bitstream on the communicationprotocol.

The server 130 may or may not be connected to a gateway 140 through acommunication network. The gateway 140 may perform different types offunctions, such as translation of a packet stream according to onecommunication protocol stack to another communication protocol stack,merging and forking of data streams, and manipulation of data streamaccording to the downlink and/or receiver capabilities, such ascontrolling the bit rate of the forwarded stream according to prevailingdownlink network conditions. Examples of gateways 140 include MCUs,gateways between circuit-switched and packet-switched video telephony,Push-to-talk over Cellular (PoC) servers, IP encapsulators in digitalvideo broadcasting-handheld (DVB-H) systems, or set-top boxes thatforward broadcast transmissions locally to home wireless networks. WhenRTP is used, the gateway 140 may be called an RTP mixer or an RTPtranslator and may act as an endpoint of an RTP connection.

The system includes one or more receivers 150, typically capable ofreceiving, de-modulating, and de-capsulating the transmitted signal intoa coded media bitstream. The coded media bitstream is transferred to arecording storage 155. The recording storage 155 may comprise any typeof mass memory to store the coded media bitstream. The recording storage155 may alternatively or additively comprise computation memory, such asrandom access memory. The format of the coded media bitstream in therecording storage 155 may be an elementary self-contained bitstreamformat, or one or more coded media bitstreams may be encapsulated into acontainer file. If there are multiple coded media bitstreams, such as anaudio stream and a video stream, associated with each other, a containerfile is typically used and the receiver 150 comprises or is attached toa container file generator producing a container file from inputstreams. Some systems operate “live,” i.e. omit the recording storage155 and transfer coded media bitstream from the receiver 150 directly tothe decoder 160. In some systems, only the most recent part of therecorded stream, e.g., the most recent 10-minute excerption of therecorded stream, is maintained in the recording storage 155, while anyearlier recorded data is discarded from the recording storage 155.

The coded media bitstream is transferred from the recording storage 155to the decoder 160. If there are many coded media bitstreams, such as anaudio stream and a video stream, associated with each other andencapsulated into a container file or a single media bitstream isencapsulated in a container file e.g. for easier access, a file parser(not shown in the figure) is used to decapsulate each coded mediabitstream from the container file. The recording storage 155 or adecoder 160 may comprise the file parser, or the file parser is attachedto either recording storage 155 or the decoder 160.

The coded media bitstream may be processed further by a decoder 160,whose output is one or more uncompressed media streams. Finally, arenderer 170 may reproduce the uncompressed media streams with aloudspeaker or a display, for example. The receiver 150, recordingstorage 155, decoder 160, and renderer 170 may reside in the samephysical device or they may be included in separate devices.

A sender 130 according to various embodiments may be configured toselect the transmitted layers for multiple reasons, such as to respondto requests of the receiver 150 or prevailing conditions of the networkover which the bitstream is conveyed. A request from the receiver canbe, e.g., a request for a change of layers for display or a change of arendering device having different capabilities compared to the previousone.

The receiver 150 may comprise a proximity detector or may be able toreceive signals from a separate proximity detector to determine thedistance of the viewer from the display and/or the position of the headof the viewer. On the basis of this distance determination the receiver150 may instruct the decoder 160 to change the spatial resolution of oneor more of the views to be displayed. In some embodiments, the receiver150 may communicate with the encoder 130 to inform the encoder 130 thatthe spatial resolution of one or more of the view can be adapted.

FIG. 12 is a schematic representation of a converter 180 according to anexample embodiment of the present invention. The converter 180 maycomprise a detector 182 to detect which decoded reference pictures mayhave to be resampled and whether a drift in sample values of the decodedpictures may be possible, a sampler 184 to resample reference pictures,and a modifier 186 to prune or otherwise modify data units of theview(s).

In one example embodiment the proximity detector is implemented by usinga camera of the receiving device and analyzing the image signal from thecamera to determine the distance and/or the head position of the viewer.

In the following table some characteristics of some video applicationsare listed in terms of the availability of a back-channel connectionfrom the recipient to the sender to control the encoding and/or videoadaptation for transmission communication network; and encoding of thevideo content (live and/or pre-recorded).

TABLE 1 Characteristics of video applications. Availability ofback-channel from recipient to sender Encoding Video telephone, singleYes Live, tailored to recipient recipient Video conferencing, Yes (mightbe Live, typically not tailored multiple recipients limited if the torecipient number of recipients is large) Unicast streaming Yes Live(typically not tailored to recipient) or pre-recordedBroadcast/multicast No or very Live (not tailored to streaming limitedrecipient) or pre-recorded File playback No Pre-recorded

When compared to symmetric spatial scalability of multiview videocoding, the invention provides a possibility for mixed-resolutionstereoscopic video, which may provide a subjective quality close to thatof full-resolution stereoscopic video particularly when the viewer isrelatively far from the display. Some embodiments of the invention alsoprovide finer granularity in bitrate adaptation, as only one view isrequired to be adapted at a time.

When compared to non-scalable multiview video coding, some embodimentsof the invention facilitate adaptation of bitrate and view resolution ata stage subsequent to encoding. If non-scalable multiview video codingis used to provide similar adaptation functionality to the invention,either of the following options may be used:

Simulcast coding. The base view is encoded at full resolution as anindependent bitstream. Two independent bitstreams are coded for thenon-base view, one at lower resolution and another at full resolution.

Inter-view predicted coding with both full- and low-resolution non-baseview (referred to as IVP coding). The base view is encoded at fullresolution. Two versions of the non-base view are coded into the samebitstream also containing the coded base view. One of the coded non-baseviews is of lower resolution, and the other one is of full resolution.Both views are coded non-scalably. For the coding and decoding of thelower resolution non-base view, reference pictures of thefull-resolution base view are resampled and included in the referencepicture list of the respective non-base view component.

Various embodiments described herein are described in the generalcontext of method steps or processes, which may be implemented in oneembodiment by a computer program product, embodied in acomputer-readable medium, including computer-executable instructions,such as program code, executed by computers in networked environments. Acomputer-readable medium may include removable and non-removable storagedevices including, but not limited to, Read Only Memory (ROM), RandomAccess Memory (RAM), compact discs (CDs), digital versatile discs (DVD),etc. Generally, program modules may include routines, programs, objects,components, data structures, etc. that perform particular tasks orimplement particular abstract data types. Computer-executableinstructions, associated data structures, and program modules representexamples of program code for executing steps of the methods disclosedherein. The particular sequence of such executable instructions orassociated data structures represents examples of corresponding acts forimplementing the functions described in such steps or processes.

Embodiments of the present invention may be implemented in software,hardware, application logic or a combination of software, hardware andapplication logic. The software, application logic and/or hardware mayreside, for example, on a chipset, a mobile device, a desktop, a laptopor a server. Software and web implementations of various embodiments canbe accomplished with standard programming techniques with rule-basedlogic and other logic to accomplish various database searching steps orprocesses, correlation steps or processes, comparison steps or processesand decision steps or processes. Various embodiments may also be fullyor partially implemented within network elements or modules. It shouldbe noted that the words “component” and “module,” as used herein and inthe following claims, is intended to encompass implementations using oneor more lines of software code, and/or hardware implementations, and/orequipment for receiving manual inputs.

The foregoing description of embodiments has been presented for purposesof illustration and description. The foregoing description is notintended to be exhaustive or to limit embodiments of the presentinvention to the precise form disclosed, and modifications andvariations are possible in light of the above teachings or may beacquired from practice of various embodiments. The embodiments discussedherein were chosen and described in order to explain the principles andthe nature of various embodiments and its practical application toenable one skilled in the art to utilize the present invention invarious embodiments and with various modifications as are suited to theparticular use contemplated. The features of the embodiments describedherein may be combined in all possible combinations of methods,apparatus, modules, systems, and computer program products.

According to a first embodiment there is provided a method for encodinga first uncompressed picture of a first view and a second uncompressedpicture of a second view into a bitstream comprising:

encoding the first uncompressed picture;

reconstructing a first decoded picture on the basis of the encoding ofthe first uncompressed picture;

resampling at least a part of the first decoded picture into a firstresampled decoded picture; and

encoding the second uncompressed picture as a first dependencyrepresentation and a second dependency representation,

wherein the first resampled decoded picture is used as a predictionreference for the encoding of the first dependency representation;

the first decoded picture is used as a prediction reference for theencoding of the second dependency representation; and

the first dependency representation is used in the encoding of thesecond dependency representation.

In some embodiments the method comprises selecting for transmission thefirst dependency representation or the second dependency representationor both the first and the second dependency representation.

In some embodiments the first view is non-scalably encoded, and thesecond view is spatially scalably encoded.

In some embodiments a maximum dependency indication value indicative ofa number of scalability layers in the second view is included in thebitstream.

In some embodiments a first maximum dependency indication valueindicative of a number of scalability layers in the first view isincluded in the bitstream, and a second maximum dependency indicationvalue indicative of a number of scalability layers in the second view isincluded in the bitstream.

In some embodiments a spatial resolution of the first uncompressedpicture and a spatial resolution of the second uncompressed picture arethe same.

In some embodiments a spatial resolution of the first uncompressedpicture and a spatial resolution of the first dependency representationare the same.

In some embodiments a spatial resolution of the first dependencyrepresentation and a spatial resolution of the second dependencyrepresentation are different.

According to a second embodiment there is provided an apparatuscomprising:

an encoder configured for encoding the first uncompressed picture of afirst view;

a reconstructor configured for reconstructing a first decoded picture onthe basis of the encoding of the first uncompressed picture;

a sampler configured for resampling at least a part of the first decodedpicture into a first resampled decoded picture; and

said encoder being further configured for

encoding a second uncompressed picture of a second view as a firstdependency representation by using the first resampled decoded pictureas a prediction reference, and

encoding a second dependency representation by using the first decodedpicture as a prediction reference and the first dependencyrepresentation in the encoding of the second dependency representation.

In some embodiments the apparatus comprises a selector for selecting fortransmission the first dependency representation or the seconddependency representation or both the first and the second dependencyrepresentation.

In some embodiments the encoder configured for non-scalably encoding thefirst view, and for spatially scalably encoding the second view.

In some embodiments the encoder is configured for setting aview_resolution_change_property value indicative of a change in aresolution of the first view or the second view.

In some embodiments the encoder is configured for including a maximumdependency indication value indicative of a number of scalability layersin the second view in the bitstream.

In some embodiments a spatial resolution of the first uncompressedpicture and a spatial resolution of the second uncompressed picture arethe same.

In some embodiments a spatial resolution of the first uncompressedpicture and a spatial resolution of the first dependency representationare the same.

In some embodiments a spatial resolution of the first dependencyrepresentation and a spatial resolution of the second dependencyrepresentation are different.

According to a third embodiment there is provided an apparatuscomprising:

a processor; and

a memory unit operatively connected to the processor and including:

computer code configured to:

encode a first uncompressed picture of a first view;

reconstruct a first decoded picture on the basis of the encoding of thefirst uncompressed picture;

resample at least a part of the first decoded picture into a firstresampled decoded picture; and

encode a second uncompressed picture of a second view as a firstdependency representation and a second dependency representation,

wherein the first resampled decoded picture is used as a predictionreference for the encoding of the first dependency representation;

the first decoded picture is used as a prediction reference for theencoding of the second dependency representation; and

the first dependency representation is used in the encoding of thesecond dependency representation.

According to a fourth embodiment there is provided a method for decodinga multiview video bitstream comprising a first view component of a firstview and a second view component of a second view, the methodcomprising:

decoding the first view component into a first decoded picture;

determining a spatial resolution of the first view component and aspatial resolution of the second view component;

on the basis of the spatial resolution of the first view component beingdifferent from the spatial resolution of the second view component:

resampling at least a part of the first decoded picture into a firstresampled decoded picture;

decoding the second view component using the first resampled decodedpicture as a prediction reference.

In some embodiments the method comprises examining an indicationindicative of a change in a spatial resolution of said first view orsaid second view, and resampling at least a part of the first decodedpicture if said indication indicates a change in the spatial resolution.

In some embodiments the method comprises comparing the differencebetween the spatial resolution of the first view component and theresolution of the second view component, and adjusting said resamplingon the basis of the difference between the spatial resolutions.

In some embodiments the bitstream comprises at least two differentdependency representations of the second view, each dependencyrepresentation provided with a dependency indication, wherein thedependency representation with the highest value for dependencyindication is decoded.

According to a fifth embodiment there is provided an apparatuscomprising:

a decoder configured for decoding a first view component of a first viewinto a first decoded picture;

a determining element configured for determining a spatial resolution ofthe first view component being different from a spatial resolution of asecond view component of a second view;

a sampler configured for resampling at least a part of the first decodedpicture into a first resampled decoded picture when the spatialresolution of the first view component differs from the spatialresolution of the second view component; and

said decoder being further configured for decoding the second viewcomponent using the first resampled decoded picture as a predictionreference.

In some embodiments the apparatus comprises an examining elementconfigured for examining an indication indicative of a change in aspatial resolution of said first view or said second view, wherein saidsampler is configured for resampling at least a part of the firstdecoded picture if said indication indicates a change in the spatialresolution.

In some embodiments the apparatus comprises a comparator configured forcomparing the difference between the spatial resolution of the firstview component and the resolution of the second view component, whereinsaid sampler is configured for adjusting said resampling on the basis ofthe difference between the spatial resolutions.

In some embodiments the bitstream comprises at least two differentdependency representations of the second view, each dependencyrepresentation provided with a dependency indication, wherein thedecoder is configured for decoding the dependency representation withthe highest value for dependency indication.

According to a sixth embodiment there is provided an apparatuscomprising:

a processor; and

a memory unit operatively connected to the processor and including

computer code configured to:

decode a first view component of a first view into a first decodedpicture;

determine a spatial resolution of the first view component and a spatialresolution of a second view component of a second view;

on the basis of the spatial resolution of the first view component beingdifferent from the spatial resolution of the second view component:

resample at least a part of the first decoded picture into a firstresampled decoded picture;

decode the second view component using the first resampled decodedpicture as a prediction reference.

According to a seventh embodiment there is provided a computer readablestorage medium stored with code thereon for use by an apparatus, whichwhen executed by a processor, causes the apparatus to perform:

encode a first uncompressed picture of a first view;

reconstruct a first decoded picture on the basis of the encoding of thefirst uncompressed picture;

resample at least a part of the first decoded picture into a firstresampled decoded picture; and

encode a second uncompressed picture of a second view as a firstdependency representation and a second dependency representation,

wherein the code, which when executed by a processor, further causes theapparatus to:

use the first resampled decoded picture as a prediction reference forthe encoding of the first dependency representation;

use the first decoded picture as a prediction reference for the encodingof the second dependency representation; and

use the first dependency representation in the encoding of the seconddependency representation.

According to an eighth embodiment there is provided a computer readablestorage medium stored with code thereon for use by an apparatus, whichwhen executed by a processor, causes the apparatus to perform:

decode a first view component of a first view into a first decodedpicture;

determine a spatial resolution of the first view component and a spatialresolution of a second view component of a second view;

on the basis of the spatial resolution of the first view component beingdifferent from the spatial resolution of the second view component:

resample at least a part of the first decoded picture into a firstresampled decoded picture;

decode the second view component using the first resampled decodedpicture as a prediction reference.

According to a ninth embodiment there is provided at least one processorand at least one memory, said at least one memory stored with codethereon, which when executed by said at least one processor, causes anapparatus to perform:

encode a first uncompressed picture of a first view;

reconstruct a first decoded picture on the basis of the encoding of thefirst uncompressed picture;

resample at least a part of the first decoded picture into a firstresampled decoded picture; and

encode a second uncompressed picture of a second view as a firstdependency representation and a second dependency representation,

wherein the code, which when executed by a processor, further causes theapparatus to:

use the first resampled decoded picture as a prediction reference forthe encoding of the first dependency representation;

use the first decoded picture as a prediction reference for the encodingof the second dependency representation; and

use the first dependency representation in the encoding of the seconddependency representation.

According to a tenth embodiment there is provided at least one processorand at least one memory, said at least one memory stored with codethereon, which when executed by said at least one processor, causes anapparatus to perform:

decode a first view component of a first view into a first decodedpicture;

determine a spatial resolution of the first view component and a spatialresolution of a second view component of a second view;

on the basis of the spatial resolution of the first view component beingdifferent from the spatial resolution of the second view component:

resample at least a part of the first decoded picture into a firstresampled decoded picture;

decode the second view component using the first resampled decodedpicture as a prediction reference.

According to an eleventh embodiment there is provided an apparatuscomprising:

means for encoding a first uncompressed picture of a first view;

means for reconstructing a first decoded picture on the basis of theencoding of the first uncompressed picture;

means for resampling at least a part of the first decoded picture into afirst resampled decoded picture; and

means for encoding a second uncompressed picture of a second view as afirst dependency representation by using the first resampled decodedpicture as a prediction reference, and

means for encoding a second dependency representation by using the firstdecoded picture as a prediction reference and the first dependencyrepresentation in the encoding of the second dependency representation.

According to a twelfth embodiment there is provided an apparatuscomprising:

means for decoding a first view component of a first view into a firstdecoded picture;

means for determining a spatial resolution of the first view componentbeing different from a spatial resolution of a second view component ofa second view;

means for resampling at least a part of the first decoded picture into afirst resampled decoded picture when the spatial resolution of the firstview component differs from the spatial resolution of the second viewcomponent; and

means for decoding the second view component using the first resampleddecoded picture as a prediction reference.

1. A method for encoding a first uncompressed picture of a first viewand a second uncompressed picture of a second view into a bitstreamcomprising: encoding the first uncompressed picture; reconstructing afirst decoded picture on the basis of the encoding of the firstuncompressed picture; resampling at least a part of the first decodedpicture into a first resampled decoded picture; and encoding the seconduncompressed picture as a first dependency representation and a seconddependency representation, wherein the first resampled decoded pictureis used as prediction reference for the encoding of the first dependencyrepresentation; the first decoded picture is used as predictionreference for the encoding of the second dependency representation; andthe first dependency representation is used in the encoding of thesecond dependency representation.
 2. The method according to claim 1further comprising selecting for transmission the first dependencyrepresentation or the second dependency representation or both the firstand the second dependency representation.
 3. The method according toclaim 1 comprising non-scalably encoding the first view, and spatiallyscalably encoding the second view.
 4. The method according to claim 1further comprising including in the bitstream a first maximum dependencyindication value indicative of a number of scalability layers in thefirst view; and including in the bitstream a second maximum dependencyindication value indicative of a number of scalability layers in thesecond view.
 5. The method according to claim 1, wherein the firstresampled decoded picture is used as prediction reference for theencoding of the first dependency representation in inter-viewprediction; and the first decoded picture is used as predictionreference for the encoding of the second dependency representation ininter-view prediction.
 6. The method according to claim 1, wherein thefirst dependency representation is used in the encoding of the seconddependency representation through an inter-layer prediction mechanism.7. An apparatus comprising: an encoder configured for encoding the firstuncompressed picture of a first view; a reconstructor configured forreconstructing a first decoded picture on the basis of the encoding ofthe first uncompressed picture; a sampler configured for resampling atleast a part of the first decoded picture into a first resampled decodedpicture; and said encoder being further configured for encoding a seconduncompressed picture of a second view as a first dependencyrepresentation by using the first resampled decoded picture asprediction reference, and encoding a second dependency representation byusing the first decoded picture as prediction reference and the firstdependency representation in the encoding of the second dependencyrepresentation.
 8. The apparatus according to claim 7, furthercomprising a selector for selecting for transmission the firstdependency representation or the second dependency representation orboth the first and the second dependency representation.
 9. Theapparatus according to claim 7, wherein the encoder is configured fornon-scalably encoding the first view, and for spatially scalablyencoding the second view.
 10. The apparatus according to claim 7,wherein the encoder is configured for using the first resampled decodedpicture as prediction reference for the encoding of the first dependencyrepresentation in inter-view prediction; and using the first decodedpicture as prediction reference for the encoding of the seconddependency representation in inter-view prediction.
 11. The apparatusaccording to claims 7, wherein the encoder is configured for using thefirst dependency representation in the encoding of the second dependencyrepresentation through an inter-layer prediction mechanism.
 12. Anapparatus comprising: a processor; and a memory unit operativelyconnected to the processor and including: computer code configured to:encode a first uncompressed picture of a first view; reconstruct a firstdecoded picture on the basis of the encoding of the first uncompressedpicture; resample at least a part of the first decoded picture into afirst resampled decoded picture; and encode a second uncompressedpicture of a second view as a first dependency representation and asecond dependency representation, wherein the first resampled decodedpicture is used as prediction reference for the encoding of the firstdependency representation; the first decoded picture is used asprediction reference for the encoding of the second dependencyrepresentation; and the first dependency representation is used in theencoding of the second dependency representation.
 13. A method fordecoding a multiview video bitstream comprising a first view componentof a first view and a second view component of a second view, the methodcomprising: decoding the first view component into a first decodedpicture; determining a spatial resolution of the first view componentand a spatial resolution of the second view component; on the basis ofthe spatial resolution of the first view component being different fromthe spatial resolution of the second view component: resampling at leasta part of the first decoded picture into a first resampled decodedpicture; decoding the second view component using the first resampleddecoded picture as prediction reference.
 14. The method according toclaim 13 further comprising examining an indication indicative of achange in a spatial resolution of said first view or said second view,and resampling at least a part of the first decoded picture if saidindication indicates a change in the spatial resolution.
 15. The methodaccording to claim 13 further comprising comparing the differencebetween the spatial resolution of the first view component and theresolution of the second view component, and adjusting said resamplingon the basis of the difference between the spatial resolutions.
 16. Themethod according to claim 15, wherein the bitstream comprises a maximumdependency indication value; and wherein the method further comprisesdetermining that the first view component being different from thespatial resolution of the second view component when the highestdependency indication of the second view being less than the maximumdependency indication value.
 17. The method according to claim 13,wherein the second view component comprises at least one dependencyrepresentation, each of the at least one dependency representationcomprises a dependency indication, wherein the method further comprisesdecoding a dependency representation with the highest value fordependency indication.
 18. An apparatus comprising: a decoder configuredfor decoding a first view component of a first view into a first decodedpicture; a determining element configured for determining a spatialresolution of the first view component being different from a spatialresolution of a second view component of a second view; a samplerconfigured for resampling at least a part of the first decoded pictureinto a first resampled decoded picture when the spatial resolution ofthe first view component differs from the spatial resolution of thesecond view component; and said decoder being further configured fordecoding the second view component using the first resampled decodedpicture as prediction reference.
 19. The apparatus according to claim 18further comprising an examining element configured for examining anindication indicative of a change in a spatial resolution of said firstview or said second view, wherein said sampler is configured forresampling at least a part of the first decoded picture if saidindication indicates a change in the spatial resolution.
 20. Theapparatus according to claim 18 further comprising a comparatorconfigured for comparing the difference between the spatial resolutionof the first view component and the resolution of the second viewcomponent, wherein said sampler is configured for adjusting saidresampling on the basis of the difference between the spatialresolutions.
 21. The apparatus according to claim 18, wherein thebitstream comprises at least two different dependency representations ofthe second view, each dependency representation provided with adependency indication, wherein the decoder is configured for decodingthe dependency representation with the highest value for dependencyindication.
 22. An apparatus comprising: a processor; and a memory unitoperatively connected to the processor and including computer codeconfigured to: decode a first view component of a first view into afirst decoded picture; determine a spatial resolution of the first viewcomponent and a spatial resolution of a second view component of asecond view; on the basis of the spatial resolution of the first viewcomponent being different from the spatial resolution of the second viewcomponent: resample at least a part of the first decoded picture into afirst resampled decoded picture; decode the second view component usingthe first resampled decoded picture as prediction reference.
 23. Acomputer readable storage medium stored with code thereon for use by anapparatus, which when executed by a processor, causes the apparatus toperform: encode a first uncompressed picture of a first view;reconstruct a first decoded picture on the basis of the encoding of thefirst uncompressed picture; resample at least a part of the firstdecoded picture into a first resampled decoded picture; and encode asecond uncompressed picture of a second view as a first dependencyrepresentation and a second dependency representation, wherein the code,which when executed by a processor, further causes the apparatus to: usethe first resampled decoded picture as prediction reference for theencoding of the first dependency representation; use the first decodedpicture as prediction reference for the encoding of the seconddependency representation; and use the first dependency representationin the encoding of the second dependency representation.
 24. A computerreadable storage medium stored with code thereon for use by anapparatus, which when executed by a processor, causes the apparatus toperform: decode a first view component of a first view into a firstdecoded picture; determine a spatial resolution of the first viewcomponent and a spatial resolution of a second view component of asecond view; on the basis of the spatial resolution of the first viewcomponent being different from the spatial resolution of the second viewcomponent: resample at least a part of the first decoded picture into afirst resampled decoded picture; decode the second view component usingthe first resampled decoded picture as prediction reference.